Smart Business Intelligence Solutions With Microsoft SQL Server 2008

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 799

Foreword by Donald Farmer

Principal Program Manager, US-SQL Analysis Services


Microsoft Corporation

Smart Business
Intelligence Solutions
with Microsoft ®

SQL Server 2008 ®

Lynn Langit
Kevin S. Goff, Davide Mauri, Sahil Malik, and John Welch
PUBLISHED BY
Microsoft Press
A Division of Microsoft Corporation
One Microsoft Way
Redmond, Washington 98052-6399
Copyright © 2009 by Kevin Goff and Lynn Langit
All rights reserved. No part of the contents of this book may be reproduced or transmitted in any form or by any means
without the written permission of the publisher.
Library of Congress Control Number: 2008940532

Printed and bound in the United States of America.

1 2 3 4 5 6 7 8 9 QWT 4 3 2 1 0 9

Distributed in Canada by H.B. Fenn and Company Ltd.

A CIP catalogue record for this book is available from the British Library.

Microsoft Press books are available through booksellers and distributors worldwide. For further infor mation about
international editions, contact your local Microsoft Corporation office or contact Microsoft Press International directly at
fax (425) 936-7329. Visit our Web site at www.microsoft.com/mspress. Send comments to mspinput@microsoft.com.

Microsoft, Microsoft Press, Access, Active Directory, ActiveX, BizTalk, Excel, Hyper-V, IntelliSense, Microsoft Dynamics,
MS, MSDN, PerformancePoint, PivotChart, PivotTable, PowerPoint, ProClarity, SharePoint, Silverlight, SQL Server, Visio,
Visual Basic, Visual C#, Visual SourceSafe, Visual Studio, Win32, Windows, Windows PowerShell, Windows Server, and
Windows Vista are either registered trademarks or trademarks of the Microsoft group of companies. Other product and
company names mentioned herein may be the trademarks of their respective owners.

The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events
depicted herein are fictitious. No association with any real company, organization, product, domain name, e-mail address,
logo, person, place, or event is intended or should be inferred.

This book expresses the author’s views and opinions. The information contained in this book is provided without any
express, statutory, or implied warranties. Neither the authors, Microsoft Corporation, nor its resellers, or distributors will
be held liable for any damages caused or alleged to be caused either directly or indirectly by this book.

Acquisitions Editor: Ken Jones


Developmental Editor: Sally Stickney
Project Editor: Maureen Zimmerman
Editorial Production: Publishing.com
Technical Reviewer: John Welch; Technical Review services provided by Content Master, a member of CM Group, Ltd.
Cover: Tom Draper Design

Body Part No. X15-12284


For Mahnaz Javid and for her work
with the Mona Foundation, which she leads

—Lynn Langit, author


Contents at a Glance
Part I Business Intelligence for Business Decision Makers
and Architects
1 Business Intelligence Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Visualizing Business Intelligence Results . . . . . . . . . . . . . . . . . . . . 27
3 Building Effective Business Intelligence Processes . . . . . . . . . . . . 61
4 Physical Architecture in Business Intelligence Solutions . . . . . . . 85
5 Logical OLAP Design Concepts for Architects . . . . . . . . . . . . . . 115

Part II Microsoft SQL Server 2008 Analysis Services


for Developers
6 Understanding SSAS in SSMS and SQL Server Profiler . . . . . . . 153
7 Designing OLAP Cubes Using BIDS . . . . . . . . . . . . . . . . . . . . . . . 183
8 Refining Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9 Processing Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 257
10 Introduction to MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
11 Advanced MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
12 Understanding Data Mining Structures . . . . . . . . . . . . . . . . . . . . 355
13 Implementing Data Mining Structures . . . . . . . . . . . . . . . . . . . . 399

Part III Microsoft SQL Server 2008 Integration Services


for Developers
14 Architectural Components of Microsoft SQL Server 2008
Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
15 Creating Microsoft SQL Server 2008 Integration Services
Packages with Business Intelligence Development Studio . . . . 463
16 Advanced Features in Microsoft SQL Server 2008
Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
17 Microsoft SQL Server 2008 Integration Services Packages
in Business Intelligence Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 515

v
vi Contents at a Glance

18 Deploying and Managing Solutions in Microsoft SQL Server


2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
19 Extending and Integrating SQL Server 2008
Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567

Part IV Microsoft SQL Server Reporting Services and


Other Client Interfaces for Business Intelligence
20 Creating Reports in SQL Server 2008 Reporting Services . . . . . 603
21 Building Reports for SQL Server 2008 Reporting Services . . . . 627
22 Advanced SQL Server 2008 Reporting Services . . . . . . . . . . . . . 647
23 Using Microsoft Excel 2007 as an OLAP Cube Client . . . . . . . . 671
24 Microsoft Office 2007 as a Data Mining Client . . . . . . . . . . . . . 687
25 SQL Server Business Intelligence and Microsoft Office
SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii

Part I Business Intelligence for Business Decision Makers


and Architects
1 Business Intelligence Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Business Intelligence and Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
OLTP and OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Online Transactional Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Online Analytical Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Common BI Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data Marts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Decision Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Data Mining Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Extract, Transform, and Load Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Report Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Core Components of a Microsoft BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
SQL Server 2008 Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
SQL Server 2008 Reporting Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
SQL Server 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Optional Components of a Microsoft BI Solution . . . . . . . . . . . . . . . . . . . . . . . . 21
Query Languages Used in BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
DMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

What do you think of this book? We want to hear from you!

Microsoft is interested in hearing your feedback so we can continually improve our books and learning
resources for you. To participate in a brief online survey, please visit:

www.microsoft.com/learning/booksurvey/
vii
viii Table of Contents

XMLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
RDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Visualizing Business Intelligence Results . . . . . . . . . . . . . . . . . . . . 27


Matching Business Cases to BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Top 10 BI Scoping Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Components of BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Understanding Business Intelligence from a User’s Perspective . . . . . . . . . . . . 34
Demonstrating the Power of BI Using Excel 2007 . . . . . . . . . . . . . . . . . . . 36
Understanding Data Mining via the Excel Add-ins . . . . . . . . . . . . . . . . . . 45
Viewing Data Mining Structures Using Excel 2007 . . . . . . . . . . . . . . . . . . 47
Elements of a Complete BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Reporting—Deciding Who Will Use the Solution . . . . . . . . . . . . . . . . . . . 51
ETL—Getting the Solution Implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Data Mining—Don’t Leave It Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Common Business Challenges and BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Measuring the ROI of BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Building Effective Business Intelligence Processes . . . . . . . . . . . . 61


Software Development Life Cycle for BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . 61
Microsoft Solutions Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Microsoft Solutions Framework for Agile Software Development . . . . . 63
Applying MSF to BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Phases and Deliverables in the Microsoft Solutions Framework . . . . . . . 65
Skills Necessary for BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Required Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Optional Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Forming Your Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Roles and Responsibilities Needed When Working with MSF . . . . . . . . . 76
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Physical Architecture in Business Intelligence Solutions . . . . . . . 85


Planning for Physical Infrastructure Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Creating Accurate Baseline Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Assessing Current Service Level Agreements . . . . . . . . . . . . . . . . . . . . . . . 87
Determining the Optimal Number and Placement of Servers . . . . . . . . . . . . . . 89
Considerations for Physical Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table of Contents ix

Considerations for Logical Servers and Services . . . . . . . . . . . . . . . . . . . . 92


Understanding Security Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Security Requirements for BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Backup and Restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Backing Up SSAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Backing Up SSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Backing Up SSRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Auditing and Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Auditing Features in SQL Server 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Source Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5 Logical OLAP Design Concepts for Architects . . . . . . . . . . . . . . 115


Designing Basic OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Star Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Denormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Back to the Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Other Design Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Modeling Snowflake Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
More About Dimensional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Understanding Fact (Measure) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 146
Other Considerations in BI Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Part II Microsoft SQL Server 2008 Analysis Services for


Developers
6 Understanding SSAS in SSMS and SQL Server Profiler . . . . . . . 153
Core Tools in SQL Server Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Baseline Service Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
SSAS in SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
How Do You Query SSAS Objects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Using MDX Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Using DMX Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Using XMLA Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Closing Thoughts on SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
x Table of Contents

7 Designing OLAP Cubes Using BIDS . . . . . . . . . . . . . . . . . . . . . . . 183


Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Offline and Online Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Working in Solution Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Data Sources in Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Data Source Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Roles in Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Using Compiled Assemblies with Analysis Services Objects . . . . . . . . . . 196
Building OLAP Cubes in BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Examining the Sample Cube in Adventure Works . . . . . . . . . . . . . . . . . . 201
Understanding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Attribute Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Attribute Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Using Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Measure Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Beyond Star Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Building Your First OLAP Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Selecting Measure Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Adding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Refining Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 225


Refining Your First OLAP Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Translations and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Calculations (MDX Scripts or Calculated Members) . . . . . . . . . . . . . . . . . 239
Using Cube and Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Time Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
SCOPE Keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Account Intelligence and Unary Operator Definitions . . . . . . . . . . . . . . 246
Other Wizard Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Currency Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Advanced Cube and Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Table of Contents xi

9 Processing Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 257


Building, Processing, and Deploying OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . 257
Differentiating Data and Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Working in a Disconnected Environment . . . . . . . . . . . . . . . . . . . . . . . . . 259
Working in a Connected Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Understanding Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Choosing Storage Modes: MOLAP, HOLAP, and ROLAP . . . . . . . . . . . . . 267
OLTP Table Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Other OLAP Partition Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Implementing Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Aggregation Design Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Usage-Based Optimization Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
SQL Server Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Aggregation Designer: Advanced View . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Implementing Advanced Storage with MOLAP, HOLAP, or ROLAP . . . . . . . . 278
Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Notification Settings for Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . 282
Fine-Tuning Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
ROLAP Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Writeback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Cube and Dimension Processing Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

10 Introduction to MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293


The Importance of MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Writing Your First MDX Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
MDX Object Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Other Elements of MDX Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
MDX Core Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Filtering MDX Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Calculated Members and Named Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Creating Objects by Using Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
The TopCount Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Rank Function and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Head and Tail Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Hierarchical Functions in MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
xii Table of Contents

Date Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321


Using Aggregation with Date Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 324
About Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

11 Advanced MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329


Querying Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Looking at Date Dimensions and MDX Seasonality . . . . . . . . . . . . . . . . . . . . . . 332
Creating Permanent Calculated Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Creating Permanent Calculated Members in BIDS . . . . . . . . . . . . . . . . . . 334
Creating Calculated Members Using MDX Scripts . . . . . . . . . . . . . . . . . . 335
Using IIf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
About Named Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
About Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Understanding SOLVE_ORDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Creating Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Creating KPIs Programmatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Additional Tips on KPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Using MDX with SSRS and PerformancePoint Server . . . . . . . . . . . . . . . . . . . . . 349
Using MDX with SSRS 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Using MDX with PerformancePoint Server 2007 . . . . . . . . . . . . . . . . . . . 352
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

12 Understanding Data Mining Structures . . . . . . . . . . . . . . . . . . . . 355


Reviewing Business Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Categories of Data Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Working in the BIDS Data Mining Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
Understanding Data Types and Content Types . . . . . . . . . . . . . . . . . . . . 361
Setting Advanced Data Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Choosing a Data Mining Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Picking the Best Mining Model Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Mining Accuracy Charts and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Data Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Microsoft Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Microsoft Decision Trees Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Microsoft Linear Regression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Microsoft Time Series Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Microsoft Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Microsoft Sequence Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Microsoft Association Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Table of Contents xiii

Microsoft Neural Network Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394


Microsoft Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
The Art of Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

13 Implementing Data Mining Structures . . . . . . . . . . . . . . . . . . . . 399


Implementing the CRISP-DM Life Cycle Model . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Building Data Mining Structures Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Adding Data Mining Models Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Processing Mining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Validating Mining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Profit Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Classification Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Data Mining Prediction Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
DMX Prediction Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
DMX Prediction Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Data Mining and Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Data Mining Object Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Data Mining Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432

Part III Microsoft SQL Server 2008 Integration Services


for Developers
14 Architectural Components of Microsoft SQL Server 2008
Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Overview of Integration Services Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 436
Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Tools and Utilities for Developing, Deploying, and Executing
Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
The Integration Services Object Model and Components . . . . . . . . . . . . . . . . 442
Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .444
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
Event Handlers and Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
xiv Table of Contents

The Integration Services Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452


The Integration Services Data Flow Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Data Flow Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
Synchronous Data Flow Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Asynchronous Data Flow Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Log Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Deploying Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Package Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Package Deployment Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

15 Creating Microsoft SQL Server 2008 Integration Services


Packages with Business Intelligence Development Studio . . . . 463
Integration Services in Visual Studio 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Creating New SSIS Projects with the Integration Services
Project Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
Viewing an SSIS Project in Solution Explorer . . . . . . . . . . . . . . . . . . . . . . 466
Using the SSIS Package Designers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Working with the SSIS Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Choosing from the SSIS Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Standard Database Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . 473
Other Types of Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Control Flow Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Control Flow Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Precedence Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Data Flow Source Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
Destination Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Transformation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Integration Services Data Viewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Variables Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Variable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
System Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Variables and Default Values Within a Package . . . . . . . . . . . . . . . . . . . . 494
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Table of Contents xv

16 Advanced Features in Microsoft SQL Server 2008


Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Error Handling in Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Events, Logs, Debugging, and Transactions in SSIS . . . . . . . . . . . . . . . . . . . . . . 499
Logging and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
Debugging Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Checkpoints and Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Configuring Package Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Best Practices for Designing Integration Services Packages . . . . . . . . . . . . . . . 509
Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

17 Microsoft SQL Server 2008 Integration Services Packages


in Business Intelligence Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 515
ETL for Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Loading OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Using Integration Services to Check Data Quality . . . . . . . . . . . . . . . . . . 516
Transforming Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Using a Staging Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Data Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Moving to Star Schema Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Loading Dimension Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Loading Fact Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Fact Table Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Dimension Table Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
ETL for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Initial Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Data Mining Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538

18 Deploying and Managing Solutions in Microsoft SQL Server


2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Solution and Project Structures in Integration Services . . . . . . . . . . . . . . . . . . 539
Source Code Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Using Visual SourceSafe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
The Deployment Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
Package Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
xvi Table of Contents

Copy File Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552


BIDS Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Deployment with the Deployment Utility . . . . . . . . . . . . . . . . . . . . . . . . . 556
SQL Server Agent and Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
Introduction to SSIS Package Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
Handling Sensitive Data and Proxy Execution Accounts . . . . . . . . . . . . . 563
Security: The Two Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
The SSIS Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565

19 Extending and Integrating SQL Server 2008


Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Introduction to SSIS Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Visual Studio Tools for Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
The Script Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
The Dts Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Debugging Script Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
The Script Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
The ComponentMetaData Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Source, Transformation, and Destination . . . . . . . . . . . . . . . . . . . . . . . . . . 582
Debugging Script Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
Overview of Custom SSIS Task and Component Development . . . . . . . . . . . . 587
Control Flow Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Data Flow Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Other Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
Overview of SSIS Integration in Custom Applications . . . . . . . . . . . . . . . . . . . . 596
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600

Part IV Microsoft SQL Server Reporting Services and Other


Client Interfaces for Business Intelligence
20 Creating Reports in SQL Server 2008 Reporting Services . . . . . 603
Understanding the Architecture of Reporting Services . . . . . . . . . . . . . . . . . . . 603
Installing and Configuring Reporting Services . . . . . . . . . . . . . . . . . . . . . . . . . . 606
HTTP Listener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Report Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Report Server Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
Background Processing (Job Manager) . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Table of Contents xvii

Creating Reports with BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612


Other Types of Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Sample Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
Deploying Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625

21 Building Reports for SQL Server 2008 Reporting Services . . . . 627


Using the Query Designers for Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . 627
MDX Query Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Setting Parameters in Your Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
DMX Query Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Working with the Report Designer in BIDS . . . . . . . . . . . . . . . . . . . . . . . . 635
Understanding Report Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
List and Rectangle Report Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Tablix Data Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Using Report Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646

22 Advanced SQL Server 2008 Reporting Services . . . . . . . . . . . . . 647


Adding Custom Code to SSRS Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Viewing Reports in Word or Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
URL Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Embedding Custom ReportViewer Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
About Report Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
About Security Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
About the SOAP API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
What Happened to Report Models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
Deployment—Scalability and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Advanced Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
Scaling Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
Administrative Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Using WMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

23 Using Microsoft Excel 2007 as an OLAP Cube Client . . . . . . . . 671


Using the Data Connection Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Working with the Import Data Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
Understanding the PivotTable Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Creating a Sample PivotTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
xviii Table of Contents

Offline OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681


Excel OLAP Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Extending Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685

24 Microsoft Office 2007 as a Data Mining Client . . . . . . . . . . . . . 687


Installing Data Mining Add-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Data Mining Integration with Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
Using the Table Analysis Tools Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Using the Data Mining Tab in Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . 700
Data Mining Integration in Visio 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
Client Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
Data Mining in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721

25 SQL Server Business Intelligence and Microsoft Office


SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Excel Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Basic Architecture of Excel Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
Immutability of Excel Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
Introductory Sample Excel Services Worksheet . . . . . . . . . . . . . . . . . . . . 726
Publishing Parameterized Excel Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
Excel Services: The Web Services API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
A Real-World Excel Services Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
SQL Server Reporting Services with Office SharePoint Server 2007 . . . . . . . . 736
Configuring SQL Server Reporting Services
with Office SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
Authoring and Deploying a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
Using the Report in Office SharePoint Server 2007: Native Mode . . . . 740
Using the Report in Office SharePoint Server 2007:
SharePoint Integrated Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
Using the Report Center Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
PerformancePoint Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747

What do you think of this book? We want to hear from you!

Microsoft is interested in hearing your feedback so we can continually improve our books and learning
resources for you. To participate in a brief online survey, please visit:

www.microsoft.com/learning/booksurvey/
Foreword
When Lynn Langit’s name appears in my inbox or RSS feeds, I never know what to expect—
only that it will be interesting! She may be inviting me to share a technical webcast, passing
pithy comments about a conference speaker, or recalling the sight of swimming elephants
in Zambia, where Lynn tirelessly promotes information technology as a force for improving
health care. On this occasion, it was an invitation to write a foreword for this, her latest book,
Smart Business Intelligence Solutions with Microsoft SQL Server 2008. As so often, when Lynn
asks, the only possible response is, “Of course—I’d be happy to!”

When it comes to business intelligence, Lynn is a compulsive communicator. As a Developer


Evangelist at Microsoft, this is part of her job, but Lynn’s enthusiasm for the technologies and
their implications goes way beyond that. Her commitment is clear in her presentations and
webcasts, in her personal engagements with customers across continents, and in her writing.
Thinking of this, I am more than pleased to see this new book, especially to see that it tackles
the SQL Server business intelligence (BI) technologies in their broad scope.

Business intelligence is never about one technology solving one problem. In fact, a good BI
solution can address many problems at many levels—tactical, strategic, and even operational.
Part I, “Business Intelligence for Business Decision Makers and Architects,” explores these
business scenarios.

To solve these problems, you will find that your raw data is rarely sufficient. The BI devel­
oper must apply business logic to enrich the data with analytical insights for business users.
Without this additional business logic, your system may only tell the users what they already
know. Part II, “Microsoft SQL Server 2008 Analysis Services for Developers,” takes a deep look
at using Analysis Services to create OLAP cubes and data mining models.

By their nature, these problems often require you to integrate data from across your busi­
ness. SQL Server 2008 Integration Services is the platform for this work, and in Part III,
“Microsoft SQL Server 2008 Integration Services for Developers,” Lynn tackles this technol­
ogy. She not only covers the details of building single workloads, but also sets this work in its
important architectural context, covering management and deployment of the integration
solutions.

Finally, in Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for
Business Intelligence,” there is a detailed exploration of the options for designing and pub­
lishing reports. This section also covers other popular “clients”—the applications through
which business users interact with your BI solution. So, even if you are a Microsoft Office
Excel user, there is valuable information here.

When all of these elements—integration, analysis, and reporting—come together, you know
you are implementing a “smart solution,” the essence of this most helpful book.

xix
xx Foreword

I know from my own work at Microsoft, presenting and writing about BI, how difficult it is to
find good symmetry between technology and the business case. I also know how important
it is. Architects may build smart technology solutions, but enterprise decision makers put the
business into BI. For these readers, Lynn makes very few assumptions. She quickly, yet quite
thoroughly, takes the reader through a basic taxonomy of the moving parts of a BI solution.

However, this book is more than a basic introduction—it gets down to the details you
need to build effective solutions. Even experienced users will find useful insights and infor­
mation here. For example, all OLAP developers work with Analysis Services data source
views. However, many of them do not even know about the useful data preview feature. In
Chapter 7, “Designing OLAP Cubes Using BIDS,” Lynn not only describes the feature, but also
includes a good example of its use for simple validation and profiling. It is, for me, a good
measure of a book that it finds new things to say even about the most familiar features.

For scenarios that may be less familiar to you, such as data mining, Lynn carefully sets out the
business cases, the practical steps to take, and the traps to avoid. Having spent many hours
teaching and evangelizing about data mining myself, I really admire how Lynn navigates
through the subject. In one chapter, she starts from the highest level (“Why would I use data
mining?”) to the most detailed (“What is the CLUSTERING_METHOD parameter for?”), retain­
ing a pleasant and easy logical flow.

It is a privilege to work at Microsoft with Lynn. She clearly loves working with her customers
and the community. This book captures much of her enthusiasm and knowledge in print. You
will enjoy it, and I will not be surprised if you keep it close at hand on your desk whenever
you work with SQL Server 2008.

Donald Farmer
Principal Program Manager, US-SQL Analysis Services
Microsoft Corporation
Acknowledgments
Many people contributed to making this book. The authors would like to acknowledge those
people and the people who support them.

Lynn Langit
Thanks to all those who supported my efforts on this book.

First I’d like to thank my inspiration and the one who kept me going during the many months
of writing this book—Mahnaz Javid—and the work of her Mona Foundation. Please prioritize
caring for the world’s needy children and take the time to contribute to organizations that do
a good job with this important work. A portion of the proceeds of this book will be donated
to the Mona Foundation. For more information, go to http://www.monafoundation.org.

Thanks to my colleagues at Microsoft Press: Ken Jones, Sally Stickney, Maureen Zimmerman;
to my Microsoft colleagues: Woody Pewitt, Glen Gordon, Mithun Dhar, Bruno Terkaly,
Joey Snow, Greg Visscher, and Scott Kerfoot; and to the SQL Team: Donald Farmer, Francois
Ajenstadt, and Zack Owens.

Thanks to my co-writers and community reviewers: Davide Mauri, Sahil Malik, Kevin Goff,
Kim Schmidt, Mathew Roche, Ted Malone, and Karen Henderson.

Thanks especially to my technical reviewer, John Welch. John, I wish I hadn’t made you work
so hard!

Thanks to my friends and family for understanding the demands that writing makes on my
time and sanity: Lynn C, Teri, Chrys, Esther, Asli, Anton, and, most especially, to my mom and
my daughter.

Davide Mauri
A very big thanks to my wife Olga, who always supports me in everything I do; and to
Gianluca, Andrea, and Fernando, who allowed me to the realize one of my many dreams!

Sahil Malik
Instead of an acknowledgment, I’d like to pray for peace, harmony, wisdom, and inner
happiness for everyone.

xxi
Introduction
So, why write? What is it that makes typing in a cramped airline seat on an 11-hour flight over
Africa so desirable? It’s probably because of a love of reading in general, and of learning in
particular. It’s not by chance that my professional blog at http://blogs.msdn.com/SoCalDevGal
is titled “Contagious Curiosity.” To understand why we wrote this particular book, you must
start with a look at the current landscape of business intelligence (BI) using Microsoft SQL
Server 2008.

Business intelligence itself really isn’t new. Business intelligence—or data warehousing, as it
has been traditionally called—has been used in particular industries, such as banking and
retailing, for many years. What is new is the accessibility of BI solutions to a broader audi-
ence. Microsoft is leading this widening of the BI market by providing a set of world-class
tools with SQL Server 2008. SQL Server 2008 includes the fourth generation of these tools
in the box (for no additional fee) and their capabilities are truly impressive. As customers
learn about the possibilities of BI, we see ever-greater critical mass adoption. We believe that
within the next few years, it will be standard practice to implement both OLTP and (BI) OLAP/
data mining solutions for nearly every installation of SQL Server 2008.

One of the most significant hindrances to previous adoption of BI technologies has not been
the quality of technologies and tools available in SQL Server 2008 or its predecessors. Rather,
what we have found (from our real-world experience) is that a general lack of understand-
ing of BI capabilities is preventing wider adoption. We find that developers, in particular, lack
understanding of BI core concepts such as OLAP (or dimensional) modeling and data mining
algorithms. This knowledge gap also includes lack of understanding about the capabilities of
the BI components and tools included with SQL Server 2008—SQL Server Analysis Services
(SSAS), SQL Server Integration Services (SSIS), and SQL Server Reporting Services (SSRS).

The gap is so significant, in fact, that it was one of the primary motivators for writing this
book. Far too many times, we’ve seen customers who lack understanding of core BI concepts
struggle to create BI solutions. Ironically, the BI tools included in SQL Server 2008 are in some
ways too easy to use. As with many Microsoft products, a right-click in the right place nearly
always starts a convenient wizard. So customers quickly succeed in building OLAP cubes and
data mining structures; unfortunately, sometimes they have no idea what they’ve actually
created. Often these solutions do not reveal their flawed underlying design until after they’ve
been deployed and are being run with production levels of data.

Because the SQL Server 2008 BI tools are designed to be intuitive, BI project implementation
is pleasantly simple, as long as what you build properly implements standard BI concepts. If
we’ve met our writing goals, you’ll have enough of both conceptual and procedural knowl-
edge after reading this book that you can successfully envision, design, develop, and deploy
a BI project built using SQL Server 2008.

xxiii
xxiv Introduction

Who This Book Is For


This book has more than one audience. The primary audience is professional developers
who want to start work on a BI project using SSAS, SSIS, and SSRS. Our approach is one of
inclusiveness—we have provided content targeted at both beginning and intermediate BI
developers. We have also included information for business decision makers who wish to
understand the capabilities of the technology and Microsoft’s associated tools. Because we
believe that appropriate architecture is the underpinning of all successful projects, we’ve also
included information for that audience.

We assume that our readers have production experience with a relational database. We also
assume that they understand relational database queries, tables, normalization and joins, and
other terms and concepts common to relational database implementations.

Although we’ve included some core information about administration of BI solutions, we


consider IT pros (or BI administrators) to be a secondary audience for this book.

What This Book Is About


This book starts by helping the reader develop an intuitive understanding of the complexity
and capabilities of BI as implemented using SQL Server 2008, and then it moves to a more
formal understanding of the concepts, architecture, and modeling. Next, it presents a more
process-focused discussion of the implementation of BI objects, such as OLAP cubes and
data mining structures, using the tools included in SQL Server 2008.

Unlike many other data warehousing books we’ve seen on the market, we’ve attempted to
attain an appropriate balance between theory and practical implementation. Another differ-
ence between our book and others is that we feel that data mining is a core part of a BI solu-
tion. Because of this we’ve interwoven information about data mining throughout the book
and have provided three chapters dedicated to its implementation.

Part I, “Business Intelligence for Business Decision Makers


and Architects”
The goal of this part of the book is to answer these questions:

■■ Why use BI?


■■ What can BI do?
■■ How do I get started?

In this first part, we address the business case for BI. We also introduce BI tools, methods,
skills, and techniques. This section is written for developers, business decision makers, and
Introduction xxv

architects. Another way to look at our goal for this section is that we’ve tried to include all of
the information you’ll need to understand before you start developing BI solutions using SQL
Server Analysis Services in the Business Intelligence Development Studio (BIDS) toolset.

Chapter 1, “Business Intelligence Basics” In this chapter, we provide a practical definition


of exactly what BI is as implemented in SQL Server 2008. Here we define concepts such as
OLAP, dimensional modeling, and more. Also, we discuss tools and terms such as BIDS, MDX,
and more. Our aim is to provide you with a foundation for learning more advanced concepts.

Chapter 2, “Visualizing Business Intelligence Results” In this chapter, we look at BI from


an end user’s perspective using built-in BI client functionality in Microsoft Office Excel 2007.
Here we attempt to help you visualize the results of BI projects—namely, OLAP cubes and
data mining models.

Chapter 3, “Building Effective Business Intelligence Processes” In this chapter, we


examine software development life-cycle processes that we use when envisioning, design-
ing, developing, and deploying BI projects. Here we take a closer look at Microsoft Solutions
Framework (and other software development life cycles) as applied to BI projects.

Chapter 4, “Physical Architecture in Business Intelligence Solutions” In this chapter,


we examine best practices for establishing baselines in your intended production BI environ-
ment. We cover tools, such as SQL Server Profiler and more, that can help you prepare to
begin a BI project. We also talk about physical servers—especially, number and placement.
We include an introduction to security concepts. We close by discussing considerations for
setting up a BI development environment.

Chapter 5, “Logical OLAP Design Concepts for Architects” In this chapter, we take a
close look at core OLAP modeling concepts—namely, dimensional modeling. Here we take a
look at star schemas, fact tables, dimensional hierarchy modeling, and more.

Part II, “Microsoft SQL Server 2008 Analysis Services


for Developers”
This part provides you with detailed information about how to use SSAS to build OLAP
cubes and data mining models. Most of this section is focused on using BIDS by working on
a detailed drill-down of all the features included. As we’ll do with each part of the book, the
initial chapters look at architecture and a simple implementation. Subsequent chapters are
where we drill into intermediate and, occasionally, advanced concepts.

Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler” In this chapter, we
look at OLAP cubes in SQL Server Management Studio and in SQL Server Profiler. We start
here because we want you to understand how to script, maintain, and move objects that
you’ve created for your BI solution. Also, SQL Server Profiler is a key tool to help you under-
stand underlying MDX or DMX queries from client applications to SSAS structures.
xxvi Introduction

Chapter 7, “Designing OLAP Cubes Using BIDS” In this chapter, we begin the work of
developing an OLAP cube. Here we start working with BIDS, beginning with the sample SSAS
database Adventure Works 2008 DW.

Chapter 8, “Refining Cubes and Dimensions” In this chapter, we dig deeper into the
details of building OLAP cubes and dimensions using BIDS. Topics include dimensional hier-
archies, key performance indicators (KPIs), MDX calculations, and cube actions. We explore
both the cube and dimension designers in BIDS in great detail in this chapter.

Chapter 9, “Processing Cubes and Dimensions” In this chapter, we take a look at cube
metadata and data storage modes. Here we discuss multidimensional OLAP (MOLAP), hybrid
OLAP (HOLAP), and relational OLAP (ROLAP). We also look at the aggregation designer and
discuss aggregation strategies in general. We also examine proactive caching.

Chapter 10, “Introduction to MDX” In this chapter, we depart from using BIDS and pre-
sent a tutorial on querying by using MDX. We present core language features and teach via
many code examples in this chapter.

Chapter 11, “Advanced MDX” In this chapter, we move beyond core language features
to MDX queries to cover more advanced language features. We also take a look at how the
MDX language is used throughout the BI suite in SQL Server 2008—that is, in BIDS for SSAS
and SSRS.

Chapter 12, “Understanding Data Mining Structures” In this chapter, we take a look at
the data mining algorithms that are included in SSAS. We examine each algorithm in detail,
including presenting configurable properties, so that you can gain an understanding of what
is possible with SQL Server 2008 data mining.

Chapter 13, “Implementing Data Mining Structures” In this chapter, we focus on practi-
cal implementation of data mining models using SSAS in BIDS. We work through each tab
of the data mining model designer, following data mining implementation from planning to
development, testing, and deployment.

Part III, “Microsoft SQL Server 2008 Integration Services


for Developers”
The goal of this part is to give you detailed information about how to use SSIS to develop
extract, transform, and load (ETL) packages. You’ll use these packages to load your OLAP
cubes and data mining structures. Again, we’ll focus on using BIDS while working on a
detailed drill-down of all the features included. As with each part of the book, the initial
chapters look at architecture and start with a simple implementation. Subsequent chapters
are where we drill into intermediate and, occasionally, advanced concepts.
Introduction xxvii

Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration


Services” In this chapter, we examine the architecture of SSIS. Here we take a look at the
data flow pipeline and more.

Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with
Business Intelligence Development Studio” In this chapter, we explain the mechanics
of package creation using BIDS. Here we present the control flow tasks and then continue
by explaining data flow sources, destinations, and transformations. We continue work-
ing through the BIDS interface by covering variables, expressions, and the rest of the BIDS
interface.

Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration


Services” In this chapter, we begin by taking a look at the error handling, logging, and
auditing features in SSIS. Next we look at some common techniques for assessing data qual-
ity, including using the new Data Profiling control flow task.

Chapter 17, “Microsoft SQL Server 2008 Integration Services Packages in Business
Intelligence Solutions” In this chapter, we take a look at extract, transform, and load pro-
cesses and best practices associated with SSIS when it’s used as a tool to create packages for
data warehouse loading. We look at this using both OLAP cubes and data mining models.

Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008
Integration Services” In this chapter, we drill into the details of SSIS package deployment
and management. Here we look at using Visual SourceSafe (VSS) and other source control
solutions to manage distributed package deployment.

Chapter 19, “Extending and Integrating SQL Server 2008 Integration Services” In this
chapter, we provide an explanation about the details of extending the functionality of SSIS
packages using .NET-based scripts.

Part IV, “Microsoft SQL Server Reporting Services and Other


Client Interfaces for Business Intelligence”
The goal of this part is to give you detailed information about how to select and implement
client interfaces for OLAP cubes and data mining structures. We’ll look in great detail at SSRS.
In addition, we’ll examine using Excel, Visio, or Office SharePoint Server 2007 as your BI client
of choice. We’ll look at SSRS architecture, then at designing reports using BIDS and other
tools. Then we’ll move to a detailed look at implementing other clients, including a discussion
of the process for embedding results in a custom Windows Form or Web Form application.
As we do with each part of the book, our first chapters look at architecture, after which we
start with simple implementation. Subsequent chapters are where we drill into intermediate
and, occasionally, advanced concepts.
xxviii Introduction

Chapter 20, “Creating Reports in SQL Server 2008 Reporting Services” In this chapter,
we present the architecture of SQL Server Reporting Services. We cover the various parts and
pieces that you’ll have to implement to make SSRS a part of your BI solution.

Chapter 21, “Building Reports for SQL Server 2008 Reporting Services” In this chap-
ter, we drill into the detail of building reports using BIDS. We take a look at the redesigned
interface and then look at the details of designing reports for OLAP cubes and data mining
models.

Chapter 22, “Advanced SQL Server 2008 Reporting Services” In this chapter, we look at
programmatically extending SSRS as well as other advanced uses of SSRS in a BI project. Here
we look at using the ReportViewer control in custom SSRS clients. We also take a look at the
new integration between SSRS and Excel and Word 2007.

Chapter 23, “Using Microsoft Excel 2007 as an OLAP Cube Client” In this chapter,
we walk through the capabilities included in Excel 2007 as an OLAP cube client. We take
a detailed look at the PivotTable functionality and also examine the PivotChart as a client
interface for OLAP cubes.

Chapter 24, “Microsoft Office 2007 as a Data Mining Client” In this chapter, we look at
using the SQL Server 2008 Data Mining Add-ins for Office 2007. These add-ins enable Excel
2007 to act as a client to SSAS data mining. We look at connecting to existing models on the
server as well creating temporary models in the Excel session using Excel source data. We
also examine the new tools that appear on the Excel 2007 Ribbon after installing the add-ins.

Chapter 25, “SQL Server 2008 Business Intelligence and Microsoft Office SharePoint
Server 2007” In this chapter, we look at integration between SQL Server Reporting
Services and SharePoint technologies. We focus on integration between SSRS and Office
SharePoint Server 2007. Here we detail the integrated mode option for SSRS and
Office SharePoint Server 2007. We also look at the Report Center template included in Office
SharePoint Server 2007 and detail just how it integrates with SSRS. We have also included
information about Excel Services.

Prerelease Software
This book was written and tested against the release to manufacturing (RTM) 2008 version of
SQL Server Enterprise software. Microsoft released the final version of Microsoft SQL Server
2008 (build number 10.0.1600.22) in August 2008. We did review and test our examples
against the final release of the software. However, you might find minor differences between
the production release and the examples, text, and screen shots in this book. We made every
attempt to update all of the samples shown to reflect the RTM; however, minor variances in
screen shots or text between the community technology preview (CTP) samples and the RTM
samples might still remain.
Introduction xxix

Hardware and Software Requirements


You’ll need the following hardware and software to work with the information and examples
provided in this book:

■■ Microsoft Windows Server 2003 Standard edition or later. Microsoft Windows Server
2008 Enterprise is preferred. The Enterprise edition of the operating system is required
if you want to install the Enterprise edition of SQL Server 2008.
■■ Microsoft SQL Server 2008 Standard edition or later. Enterprise edition is required for
using all features discussed in this book. Installed components needed are SQL Server
Analysis Services, SQL Server Integration Services, and SQL Server Reporting Services.
■■ SQL Server 2008 Report Builder 2.0.
■■ Visual Studio 2008 (Team System is used to show the examples).
■■ Office SharePoint Server 2007 (Enterprise Edition or Windows SharePoint Services 3.0).
■■ Office 2007 Professional edition or better, including Excel 2007 and Visio 2007.
■■ SQL Server 2008 Data Mining Add-ins for Office 2007.
■■ 1.6 GHz Pentium III+ processor or faster.
■■ 1 GB of available, physical RAM.
■■ 10 GB of hard disk space for SQL Server and all samples
■■ Video (800 by 600 or higher resolution) monitor with at least 256 colors.
■■ CD-ROM or DVD-ROM drive.
■■ Microsoft mouse or compatible pointing device.

Find Additional Content Online


As new or updated material becomes available that complements this book, it will be posted
online on the Microsoft Press Online Developer Tools Web site. The type of material you
might find includes updates to book content, articles, links to companion content, errata,
sample chapters, and more. This Web site is located at www.microsoft.com/learning/books/
online/developer and is updated periodically.

Lynn Langit is recording a companion screencast series named “How Do I BI?” Find this series
via her blog at http://blogs.msdn.com/SoCalDevGal.
xxx Introduction

Support for This Book


Every effort has been made to ensure the accuracy of this book. As corrections or changes
are collected, they will be added to a Microsoft Knowledge Base article.

Microsoft Press provides support for books at the following Web site:

http://www.microsoft.com/learning/support/books/

Questions and Comments


If you have comments, questions, or ideas regarding the book, or questions that are not
answered by visiting the site above, please send them to Microsoft Press via e-mail to

mspinput@microsoft.com

Or via postal mail to

Microsoft Press
Attn: Smart Business Intelligence Solutions with Microsoft SQL Server 2008 Editor
One Microsoft Way
Redmond, WA 98052-6399

Please note that Microsoft software product support is not offered through the above
addresses.
Part I
Business Intelligence for Business
Decision Makers and Architects

1
Chapter 1
Business Intelligence Basics
Many real-world business intelligence (BI) implementations have been delayed or even
derailed because key decision makers involved in the projects lacked even a general under-
standing of the potential of the product stack. In this chapter, we provide you with a con-
ceptual foundation for understanding the broad potential of the BI technologies within
Microsoft SQL Server 2008 so that you won’t have to be in that position. We define some of
the basic terminology of business intelligence, including OLTP and OLAP, and go over the
components, both core and optional, of Microsoft BI solutions. We also introduce you to the
development languages involved in BI projects, including MDX, DMX, XMLA, and RDL.

If you already know these basic concepts, you can skip to Chapter 2, “Visualizing Business
Intelligence Results,” which talks about some of the common business problems that BI
addresses.

Business Intelligence and Data Modeling


You’ll see the term business intelligence defined in many different ways and in various con-
texts. Some vendors manufacture a definition that shows their tools in the best possible
light. You’ll sometimes hear BI summed up as “efficient reporting.” With the BI tools included
in SQL Server 2008, business intelligence is much more than an overhyped, supercharged
reporting system. For the purposes of this book, we define business intelligence in the same
way Microsoft does:

Business intelligence solutions include effective storage and presentation of key


enterprise data so that authorized users can quickly and easily access and interpret
it. The BI tools in SQL Server 2008 allow enterprises to manage their business at
a new level, whether to understand why a particular venture got the results it did,
to decide on courses of action based on past data, or to accurately forecast future
results on the basis of historical data.

You can customize the display of BI data so that it is appropriate for each type of user. For
example, analysts can drill into detailed data, executives can see timely high-level summaries,
and middle managers can request data presented at the level of detail they need to make
good day-to-day business decisions. Microsoft BI usually uses data structures (called cubes or
data mining structures) that are optimized to provide fast, easy-to-query decision support.
This BI data is presented to users via various types of reporting interfaces. These formats can
include custom applications for Microsoft Windows, the Web, or mobile devices as well as
Microsoft BI client tools, such as Microsoft Office Excel or SQL Server Reporting Services.

3
4 Part I Business Intelligence for Business Decision Makers and Architects

Figure 1-1 shows a conceptual view of a BI solution. In this figure, multiple types of source
data are consolidated into a centralized data storage facility. For a formal implementation of
a BI solution, the final destination container is most commonly called a cube. This consolida-
tion can be physical—that is, all the source data is physically combined onto one or more
servers—or logical, by using a type of a view. We consider BI conceptual modeling in more
detail in Chapter 5, “Logical OLAP Design Concepts for Architects.”

Clients with Access,


Excel, Word Files, etc.
Database Servers
Relational Data

Mainframes BI Cluster
Other Servers
Web Services

BI Servers

FIgure 1-1 Business intelligence solutions present a consolidated view of enterprise data. This view can be a
physical or logical consolidation, or a combination of both.

Although it’s possible to place all components of a BI solution on a single physical server, it’s
more typical to use multiple physical servers to implement a BI solution. Microsoft Windows
Server 2008 includes tremendous improvements in virtualization, so the number of physical
servers involved in a BI solution can be greatly reduced if you are running this version. We
talk more about physical modeling for BI solutions in Chapter 5.

Before we examine other common BI terms and components, let’s review two core concepts
in data modeling: OLTP and OLAP.
Chapter 1 Business Intelligence Basics 5

OLTP and OLAP


You’ve probably heard the terms OLTP and OLAP in the context of data storage. When plan-
ning SQL Server 2008 BI solutions, you need to have a solid understanding of these systems
as well as the implications of using them for your particular requirements.

Online Transactional Processing


OLTP stands for online transactional processing and is used to describe a relational data store
that is designed and optimized for transactional activities. Transactional activities are defined
as inserts, updates, and deletes to rows in tables. A typical design for this type of data stor-
age system is to create a large number of normalized tables in a single source database.

relational vs. Nonrelational Data


SQL Server 2008 BI solutions support both relational and nonrelational source data.
Relational data usually originates from a relational database management system
(RDBMS) such as SQL Server 2008 (or an earlier version of SQL Server) or an RDBMS
built by a different vendor, such as Oracle or IBM. Relational databases generally consist
of a collection of related tables. They can also contain other objects, such as views or
stored procedures.

Nonrelational data can originate from a variety of sources, including Windows


Communication Foundation (WCF) or Web services, mainframes, and file-based appli-
cations, such as Microsoft Office Word or Excel. Nonrelational data can be presented
in many formats. Some of the more common formats are XML, TXT, CSV, and various
binary formats.

Normalization in relational data stores is usually implemented by creating a primary-key-to-


foreign-key relationship between the rows in one table (often called the parent table) and the
rows in another table (often called the child table). Typically (though not always), the rows in
the parent table have a one-to-many relationship with the rows in the child table. A common
example of this relationship is a Customer table and one or more related [Customer] Orders
tables. In the real world, examples are rarely this simple. Variations that include one-to-one
or many-to-many relationships, for example, are possible. These relationships often involve
many source tables.

Figure 1-2 shows the many tables that can result when data stores modeled for OLTP are nor-
malized. The tables are related by keys.
6 Part I Business Intelligence for Business Decision Makers and Architects

FIgure 1-2 Sample from AdventureWorks OLTP database

The primary reasons to model a data store in this way (that is, normalized) are to reduce the
total amount of data that needs to be stored and to improve the efficiency of performing
inserts, updates, and deletes by reducing the number of times the same data needs to be
added, changed, or removed. Extending the example in Figure 1-2, if you inserted a sec-
ond order for an existing customer and the customer’s information hadn’t changed, no new
information would have to be inserted into the Customer table; instead, only one or more
rows would have to be inserted into the related Orders tables, using the customer identifier
(usually a key value), to associate the order information with a particular customer. Although
this type of modeling is efficient for these activities (that is, inserting, updating, and deleting
data), the challenge occurs when you need to perform extensive reading of these types of
data stores.

To retrieve meaningful information from the list of Customers and the particular Order infor-
mation shown in Figure 1-2, you’d first have to select the rows meeting the report criteria
from multiple tables and then sort and match (or join) those rows to create the information
you need. Also, because a common business requirement is viewing aggregated information,
you might want to see the total sales dollar amount purchased for each customer, for exam-
ple. This requirement places additional load on the query processing engine of your OLTP
data store. In addition to selecting, fetching, sorting, and matching the rows, the engine also
has to aggregate the results.

If the query you make involves only a few tables (for example, the Customer table and the
related SalesOrderHeader tables shown in Figure 1-2), and if these tables contain a small
Chapter 1 Business Intelligence Basics 7

number of rows, the overhead incurred probably would be minimal. (In this context, the
definition of a “small” number is relative to each implementation and is best determined
by baseline performance testing during the development phase of the project.) You might
be able to use a highly normalized OLTP data store to support both CRUD (create, retrieve,
update, delete) and read-only (decision support or reporting) activities. The processing
speed depends on your hardware resources and the configuration settings of your database
server. You also need to consider the number of users who need to access the information
simultaneously.

These days, the OLTP data stores you are querying often contain hundreds or even thou-
sands of source tables. The associated query processors must filter, sort, and aggregate mil-
lions of rows from the related tables. Your developers need to be fluent in the data store
query language so that they can write efficient queries against such a complex structure, and
they also need to know how to capture and translate every business requirement for report-
ing. You might need to take additional measures to improve query (and resulting report)
performance, including rewriting queries to use an optimal query syntax, analyzing the
query execution plan, providing hints to force certain execution paths, and adding indexes
to the relational source tables. Although these strategies can be effective, implementing
them requires significant skill and time on the part of your developers. Figure 1-3 shows an
example of a typical reporting query—not the complexity of the statement, but the number
of tables that can be involved.

FIgure 1-3 Sample reporting query against a normalized data store


8 Part I Business Intelligence for Business Decision Makers and Architects

SQL Server 2008 includes a powerful tool, the Database Engine Tuning Advisor, that can assist
you in manual tuning efforts, though the amount of time needed to implement and maintain
manual query tuning can become significant. Other costs are also involved with OLTP query
optimization, the most significant of which is the need for additional storage space and main-
tenance tasks as indexes are added. A good way to think of the move from OLTP alone to
a combined solution that includes both an OLTP store and an OLAP store is as a continuum
that goes from OLTP (single RDBMS) relational to relational copy to OLAP (cube) nonrela-
tional. In particular, if you’re already making a copy of your OLTP source data to create a
more efficient data structure from which to query for reporting and to reduce the load on
your production OLTP servers, you’re a prime candidate to move to a more formalized OLAP
solution based on the dedicated BI tools included in SQL Server 2008.

Online Analytical Processing


OLAP stands for online analytical processing and is used to describe a data structure that is
designed and optimized for analytical activities. Analytical activities are defined as those that
focus on the best use of data for the purpose of reading it rather than optimizing it so that
changes can be made in the most efficient way. In fact, many OLAP data stores are imple-
mented as read-only. Other common terms for data structures set up and optimized for
OLAP are decision support systems, reporting databases, data warehouses, or cubes.

As with OLTP, definitions of an OLAP data store vary depending on who you’re talking to. At
a minimum, most professionals would agree that an OLAP data store is modeled in a denor-
malized way. When denormalizing, you use very wide relational tables (those containing
many columns) with deliberately duplicated information. This approach reduces the number
of tables that must be joined to provide query results and lets you add indexes. Reducing the
size of surface area that is queried results in faster query execution. Data stores that are mod-
eled for OLAP are usually denormalized using a specific type of denormalization modeling
called a star schema. We cover this technique extensively in Chapter 4, “Physical Architecture
in Business Intelligence Solutions.”

Although using a denormalized relational store addresses some of the challenges encoun-
tered when trying to query an OLTP (or normalized) data store, a denormalized relational
store is still based on relational tables, so the bottlenecks are only partially mitigated.
Developers still must write complex queries for each business reporting requirement and
then manually denormalize the copy of the OLTP store, add indexes, and further tune que-
ries as performance demands. The work involved in performing these tasks can become
excessive.

Figure 1-4 shows a portion of a denormalized relational data store. We are working with the
AdventureWorksDW sample database, which is freely available for download and is designed
to help you understand database modeling for loading OLAP cubes. Notice the numerous
columns in each table.
Chapter 1 Business Intelligence Basics 9

FIgure 1-4 Portion of AdventureWorksDW

Another way to implement OLAP is to use a cube rather than a group of tables. A cube is one
large store that holds all associated data in a single structure. The structure contains not only
source data but also pre-aggregated values. A cube is also called a multidimensional data
store.

Aggregation
Aggregation is the application of some type of mathematic function to data. Although
aggregation can be as simple as doing a SUM on numeric values, in BI solutions, it is
often much more complex. In most relational data stores, the number of aggregate
functions available is relatively small. For SQL Server 2008, for example, Transact-SQL
contains 12 aggregate functions: AVG, MIN, CHECKSUM_AGG, SUM, COUNT, STDEV,
COUNT_BIG, STDEVP, GROUPING, VAR, MAX, and VARP. Contrast this with the number
of built-in functions available in the SQL Server cube store: over 150. The number and
type of aggregate functions available in the cube store are more similar to those avail-
able in Excel than to those in SQL Server 2008.

So, what exactly does a cube look like? It can be tricky to visualize a cube because most of us
can imagine only in three dimensions. Cubes are n-dimensional structures because they can
store data in an infinite number of dimensions. A conceptual rendering of a cube is shown in
10 Part I Business Intelligence for Business Decision Makers and Architects

Figure 1-5. Cubes contain two major aspects: facts and dimensions. Facts are often numeric
and additive, although that isn’t a requirement. Facts are sometimes called measures. An
example of a fact is “gross sales amount.” Dimensions give meaning to facts. For example,
you might need to be able to examine gross sales amount by time, product, customer, and
employee. All of the “by xxx” values are dimensions. Dimensional information is accessed
via a hierarchy of information along each dimensional axis. You’ll also hear the term Unified
Dimensional Model (UDM) to describe an OLAP cube because, in effect, such a cube “unifies”
a group of dimensions. We discuss this type of modeling in much greater detail in Chapter 5.

Rail
Ground
Route Road
Sea
Nonground
Air

190 215 160 240


Africa
Feb-17-99 Apr-22-99 Sep-07-99 Dec-01-99
510 600 520 780
Asia
Eastern Mar-19-99 May-31-99 Sep-18-99 Dec-22-99
Hemisphere
210 240 300 410
Australia
Mar-05-99 May-19-99 Aug-09-99 Nov-27-99
Source 500 470 4644 696
Europe
Mar-07-99 Jun-w0-99 Sep-11-99 Dec-15-99
North 3056 4050 4360 5112
Western America Mar-30-99 Jun-28-99 Sep-30-99 Dec-29-99
Hemisphere 600 490 315 580
South
America Feb-27-99 Jun-03-99 Aug-21-99 Nov-30-99
1st 2nd 3rd 4th
Quarter Quarter Quarter Quarter
Measures
Packages 1st Half 2nd Half
Last
Time
FIgure 1-5 Sample cube structure

You might be wondering why a cube is preferable to a denormalized relational data store.
The answer is efficiency—in terms of scalability, performance, and ease of use. In the case of
using the business intelligence toolset in SQL Server 2008, you’ll also get a query process-
ing engine that is specifically optimized to deliver fast queries, particularly for queries that
involve aggregation. We’ll review the business case for using a dedicated OLAP solution in
more detail in Chapter 2. Also, we’ll look at real-world examples throughout the book.

The next phase of understanding an OLAP cube is translating that n-dimensional structure
to a two-dimensional screen so that you can visualize what users will see when working with
an OLAP cube. The standard viewer for a cube is a pivot table interface, a sample of which
Chapter 1 Business Intelligence Basics 11

is shown in Figure 1-6. The built-in viewer in the developer and administrative interfaces for
SQL Server 2008 OLAP cubes both use a type of pivot table to allow developers and admin-
istrators to visualize the cubes they are working with. Excel 2007 PivotTables are also a com-
mon user interface for SQL Server 2008 cubes.

FIgure 1-6 SQL Server 2008 cube presented in a PivotTable

Common BI Terminology
A number of other conceptual terms are important to understand when you’re planning a BI
solution. In this section, we’ll talk about several of these: data warehouses; data marts; cubes;
decision support systems; data mining systems; extract, transform, and load systems; report
processing systems; and key performance indicators.

Data Warehouses
A data warehouse is a single structure that usually consists of one or more cubes. Figure 1-7
shows the various data sources that contribute to an OLAP cube. Data warehouses are used
to hold an aggregated, or rolled-up (and most commonly) read-only, view of the major-
ity of an organization’s data. This structure includes client query tools. When planning and
implementing your company’s data warehouse, you need to decide which data to include
and at what level of detail (or granularity). We explore this concept in more detail in “Extract,
Transform, and Load Systems” later in this chapter.
12 Part I Business Intelligence for Business Decision Makers and Architects

WCF or
RDBS: SQL Server, Oracle, DB2, etc. Web Services Mainframe Data

File-Based
Relational Data:
Access,
Excel, etc.
Semi-Structured XML

OLAP Cube
FIgure 1-7 Conceptual OLAP cube

The terms OLAP and data warehousing are sometimes used interchangeably. However, this
is a bit of an oversimplification because an OLAP store is modeled as a cube or multidimen-
sionally, whereas a data warehouse can use either denormalized OLTP data or OLAP. OLAP
and data warehousing are not new technologies. The first Microsoft OLAP tools were part of
SQL Server 7. What is new in SQL Server 2008 is the inclusion of powerful tools that allow you
to implement a data warehouse using an OLAP cube (or cubes). Implementing BI solutions
built on OLAP is much easier because of improved tooling, performance, administration, and
usability, which reduces total cost of ownership (TCO).

Pioneers of Data Warehousing


Data warehousing has been available, usually implemented via specialized tools, since
the early 1980s. Two principal thought leaders of data warehousing theory are Ralph
Kimball and Bill Inmon. Both have written many articles and books and have popular
Web sites talking about their extensive experience with data warehousing solutions
using products from many different vendors.

To read more about Ralph Kimball’s ideas on data warehouse design modeling, go to
http://www.ralphkimball.com. I prefer the Kimball approach to modeling and have had
good success implementing Kimball’s methods in production BI projects. For a simple
explanation of the Kimball approach, see http://en.wikipedia.org/wiki/Ralph_Kimball.
Chapter 1 Business Intelligence Basics 13

Data Marts
A data mart is a defined subset of enterprise data, often a single cube from a group of cubes,
that is intended to be consolidated into a data warehouse. The single cube represents one
business unit (for example, marketing) from a greater whole (for example, the entire com-
pany). Data marts were the basic units of organization in the OLAP tools that were included
in earlier versions of SQL Server BI solutions because of restrictions in the tools themselves.
The majority of these restrictions were removed in SQL Server 2005. Because of this, data
warehouses built using the tools provided by SQL Server 2008 often consist of one huge
cube. This is not the case with many competitive OLAP products. There are, of course, excep-
tions to this single-cube design. However, limits in the product stack are not what determines
this type of design, rather, it is determined by OLAP modeler or developer preferences.

Cubes
As described earlier in the chapter, a BI cube is a data structure used by classic data ware-
housing products (including SQL Server 2008) in place of many relational tables. Rather
than containing tables with rows and columns, cubes consist of dimensions and measures
(or facts). Cubes can also contain data that is pre-aggregated (usually summed) rather than
included as individual items (or rows). In some cases, cubes contain a complete copy of pro-
duction data; in other cases, they contain subsets of source data.

In SQL Server 2008, cubes are more scalable and perform better than in previous versions
of SQL Server, so you can include data with much more detail than you could include when
using previous versions of the SQL Server 2008 OLAP tool, with many fewer adverse effects
on scalability and performance. As in previous versions, when you are using SQL Server 2008,
you will copy the source data from any number of disparate source systems to the destina-
tion OLAP cubes via extract, transform, and load (ETL) processes. (You’ll find out more about
ETL shortly.)

We talk a lot about cubes in this book, from their physical and logical design in Chapters
5 and 6 to the use of the cube-building tools that come with SQL Server 2008 in Part II,
“Microsoft SQL Server 2008 Analysis Services for Developers.”

Decision Support Systems


The term decision support system can mean anything from a read-only copy of an OLTP data
store to a group of OLAP cubes, or even a mixture of both. If the data source consists of only
an OLTP data store, then this type of store can be limited in its effectiveness because of the
challenges discussed earlier in this chapter, such as the difficulty of efficient querying and the
overhead required for indexes. Another way to think about a decision support system is some
type of data structure (such as a table or a cube) that is being used as a basis for developing
end-user reporting. End-user in this context means all types or categories of users. These
14 Part I Business Intelligence for Business Decision Makers and Architects

usually include business decision makers, middle managers, and general knowledge workers.
It is critical that your solution be able to provide data summarized at a level of detail that is
useful to these various end-user communities. The best BI solutions are intuitive for the vari-
ous end-user communities to work with—little or no end-user training is needed.

In this book, we focus on using the more efficient OLAP data store (or cube) as a source for a
decision support system.

Data Mining Systems


Data mining can be understood as a complementary technique to OLAP. Whereas OLAP is
used to provide decision support or the data to prove a particular hypothesis, data mining
is used in situations in which you have no solid hypothesis about the data. For example, you
could use an OLAP cube to verify that customers who purchased a certain product during
a certain timeframe had certain characteristics. Specifically, you could prove that customers
who purchased cars during December 2007 chose red-colored cars twice as often as they
picked black-colored cars if those customers shopped at locations in postal codes 90201 to
90207. You could use a data mining store to automatically correlate purchase factors into
buckets, or groups, so that decision makers could explore correlations and then form more
specific hypotheses based on their investigation. For example, they could decide to group or
cluster all customers segmented into “car purchasers” and “non-car-purchasers” categories.
They could further examine the clusters to find that “car purchasers” had the following traits
most closely correlated, in order of priority: home owners (versus non–home owners), mar-
ried (versus single), and so on.

Another scenario for which data mining is frequently used is one where your business
requirements include the need to predict one or more future target values in a dataset. An
example of this would be the rate of sale—that is, the number of items predicted to be sold
over a certain rate of time.

We explore the data mining support included in SQL Server 2008 in greater detail in Chapter
6, “Understanding SSAS in SSMS and SQL Server Profiler,” in the context of logical modeling.
We also cover it in several subsequent chapters dedicated to implementing solutions that use
data mining.

Extract, Transform, and Load Systems


Commonly expressed as ETL, extract, transform, and load refers to a set of services that
facilitate the extraction, transformation, and loading of the various types of source data (for
example, relational, semi-structured, and unstructured) into OLAP cubes or data mining
structures. SQL Server 2008 includes a sophisticated set of tools to accomplish the ETL pro-
cesses associated with the initial loading of data into cubes as well as to process subsequent
Chapter 1 Business Intelligence Basics 15

incremental inserts of data into cubes, updates to data in cubes, and deletions of data from
cubes. ETL is explored in detail in Part III, “Microsoft SQL Server 2008 Integration Services for
Developers.” A common error made in BI solutions is underestimating the effort that will be
involved in the ETL processes for both the initial OLAP cube and the data mining structure
loads as well as the effort involved in ongoing maintenance, which mostly consists of insert-
ing new data but can also include updating and deleting data. It is not an exaggeration to say
that up to 75 percent of the project time for the initial work on a BI project can be attributed
to the ETL portion of the project. The “dirtiness,” complexity, and general incomprehensibility
of the data originating from various source systems are factors that are often overlooked in
the planning phase. By “dirtiness,” we mean issues such as invalid data. This can include data
of an incorrect type, format, length, and so on.

Report Processing Systems


Most BI solutions use more than one type of reporting client because of the different needs
of the various users who need to interact with the cube data. An important part of plan-
ning any BI solution is to carefully consider all possible reporting tools. A common produc-
tion mistake is to under-represent the various user populations or to clump them together
when a more thorough segmentation would reveal very different reporting needs for each
population.

SQL Server 2008 includes a report processing system designed to support OLAP data
sources. In Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for
Business Intelligence,” we explore the included tools—such as SQL Server Reporting Services,
Office SharePoint Server 2007, and PerformancePoint Server—as well as other products that
are part of the Microsoft product suite for reporting.

Key Performance Indicators


Key performance indicators (KPIs) are generally used to indicate a goal that consists of sev-
eral values—actual, target, variance to target, and trend. Most often, KPIs are expressed and
displayed graphically—for example, as different colored traffic lights (red, yellow, or green).
KPIs usually include drill-down capabilities that allow interested decision makers to review the
data behind the KPI. KPIs can be implemented as part of an OLAP system, and they are often
part of reporting systems, which are most typically found on reporting dashboards or score-
cards. It is quite common for business requirements to include the latter as part of a central-
ized performance management strategy. Microsoft’s BI tools include the ability to create and
view KPIs. You can create KPIs from nearly any type of data source, such as OLAP cubes, Excel
workbooks, or SharePoint lists.
16 Part I Business Intelligence for Business Decision Makers and Architects

Core Components of a Microsoft BI Solution


The core components of a Microsoft BI solution are SQL Server Analysis Services (SSAS), SQL
Server Reporting Services (SSRS), and SQL Server 2008 itself. SQL Server is often used as a
data source or an intermediate repository for data as it is being validated in preparation for
loading into an OLAP cube. The ETL toolset, called SQL Server Integration Services (SSIS),
also requires a SQL Server 2008 license. Each of these core services comes with its own set
of management interface tools. The SSAS development interface is the Business Intelligence
Development Studio (BIDS); SSIS and SSRS use the same development interface. The adminis-
trative interface all three use is SQL Server Management Studio (SSMS).

We typically use SQL Server Reporting Services in our solutions. Sometimes a tool other than
SSRS will be used as the report engine for data stores developed using SSAS. However, you
will typically use, at a minimum, the components just listed for a BI solution built using SQL
Server 2008. For example, you could purchase or build a custom application that doesn’t use
SSRS to produce reports. In that case, the application would include data queries as well as
calls to the SSAS structures via the application programming interfaces (APIs) included for it.

SQL Server 2008 Analysis Services


SSAS is the core service in a Microsoft BI solution. It provides the storage and query mecha-
nisms for the data used in OLAP cubes for the data warehouse. It also includes sophisticated
OLAP cube developer and administrative interfaces. SSAS is usually installed on at least one
dedicated physical server. You can install both SQL Server 2008 and SSAS on the same physi-
cal server, but this is done mostly in test environments. Figure 1-8 shows the primary tool
you’ll use to develop cubes for SSAS, the Business Intelligence Development Studio. BIDS
opens in a Microsoft Visual Studio environment. You don’t need a full Visual Studio installa-
tion to develop cubes for SSAS. If Visual Studio is not on your development machine, when
you install SSAS, BIDS installs it as a stand-alone component. If Visual Studio is on your devel-
opment machine, BIDS installs as a component (really, a set of templates) in your existing
Visual Studio instance.
Chapter 1 Business Intelligence Basics 17

FIgure 1-8 AdventureWorksDW in the OLAP cube view within BIDS

Note If you’re running a full version of Visual Studio 2008 on the same machine where you
intend to work with SSAS, you must install Service Pack 1 (SP1) for Visual Studio 2008.

AdventureWorksDW
AdventureWorksDW is the sample data and metadata that you can use while learn-
ing about the tools and capabilities of the SQL Server 2008 BI tools. We provide more
information about how to work with this sample later in this chapter. All screen shots
in this book show this sample being used. The samples include metadata and data
so that you can build OLAP cubes and mining structures, SSIS packages, and SSRS
reports. These samples are also available on Microsoft’s public, shared-source Web site:
CodePlex at http://codeplex.com/SqlServerSamples. Here you’ll find the specific locations
from which you can download these samples. Be sure to download samples that match
your version (for example, 2008 or 2005) and platform (x86 or x64). When running the
samples, be sure to use the sample for your edition of SQL Server.
18 Part I Business Intelligence for Business Decision Makers and Architects

Data Mining with Analysis Services 2008


SSAS also includes a component that allows you to create data mining structures that include
data mining models. Data mining models are objects that contain source data (either rela-
tional or multidimensional) that has been processed using one or more data mining algo-
rithms. These algorithms either classify (group) data or classify and predict one or more
column values. Although data mining has been available since SSAS 2000, its capabilities
have been significantly enhanced in the SQL Server 2008 release. Performance is improved,
and additional configuration capabilities are available. Figure 1-9 shows a data mining model
visualizer that comes with SQL Server 2008. Data mining visualizers are included in the
data mining development environment (BIDS), as well as in some client tools, such as Excel.
Chapter 12, “Understanding Data Mining Structures,” and Chapter 13, “Implementing Data
Mining Structures,” cover the data mining capabilities in SSAS in more detail.

FIgure 1-9 Business Intelligence Development Studio (BIDS) data mining visualizer
Chapter 1 Business Intelligence Basics 19

SQL Server 2008 Reporting Services


Another key component in many BI solutions is SQL Server Reporting Services (SSRS). When
working with SQL Server 2008 to perform SSRS administrative tasks, you can use a variety of
included tools such as SSMS, a reporting Web site, or a command-line tool.

The enhancements made in SQL Server 2008 Reporting Services make it an attractive part of
any BI solution. The SSRS report designer in BIDS includes a visual query designer for SSAS
cubes, which facilitates rapid report creation by reducing the need to write manual queries
against OLAP cube data. SSRS includes another report-creation component, Report Builder,
which is intended to be used by analysts, rather than developers, to design reports. SSRS
also includes several client tools: a Web interface (illustrated in Figure 1-10), Web Parts for
Microsoft Office SharePoint Server, and client components for Windows Forms applications.

We discuss all flavors of reporting clients in Part IV.

FIgure 1-10 SSRS reports can be displayed by using the default Web site interface.

SQL Server 2008


In addition to being a preferred staging source for BI data, SQL Server 2008 RDBMS data is
often a portion of the source data for BI solutions. As we mentioned earlier in this chapter,
data can be and often is retrieved from a variety of relational source data stores (for example,
Oracle, DB2, and so forth). To be clear, data from any source for which there is a provider can
be used as source data for an OLAP (SSAS) cube, which means data from all versions of SQL
Server along with data from other RDBMS systems.

A SQL Server 2008 installation isn’t strictly required to implement a BI solution; however,
because of the integration of some key toolsets that are part of nearly all BI solutions, such as
SQL Server Integration Services—which is usually used to perform the ETL processes for the
OLAP cubes and data mining structures—most BI solutions should include at least one SQL
Server 2008 installation. As we said earlier, although the SQL Server 2008 installation can be
on the same physical server where SSAS is installed, it is more common to use a dedicated
server.

You use the SQL Server Management Studio to administer OLTP databases, SSAS (OLAP)
cubes and data mining models, and SSIS packages. The SSMS interface showing only the
Object Explorer is shown in Figure 1-11.
20 Part I Business Intelligence for Business Decision Makers and Architects

FIgure 1-11 SSMS in SQL Server 2008

SQL Server 2008 Integration Services


SSIS is a key component in most BI solutions. This toolset is used to import, cleanse, and
validate data prior to making the data available to SSAS for OLAP cubes or data mining struc-
tures. It is typical to use data from many disparate sources (for example, relational, flat file,
XML, and so on) as source data to a data warehouse. For this reason, a sophisticated toolset
such as SSIS facilitates the complex data loads (ETL) that are common to BI solutions. The
units of execution in SSIS are called packages. They are a type of XML files that you can con-
sider to be a set of instructions that are designed using visual tools in BIDS. We discuss plan-
ning, implementation, and many other considerations for SSIS packages in Part III.

You’ll use BIDS to design, develop, execute, and debug SSIS packages. The BIDS SSIS package
design environment is shown in Figure 1-12.
Chapter 1 Business Intelligence Basics 21

FIgure 1-12 BIDS SSIS package designer

Optional Components of a Microsoft BI Solution


In addition to the components that are included with SQL Server 2008, a number of other
Microsoft products can be used as part of your BI solution. Most of these products allow you
to deliver the reports generated from Analysis Services OLAP cubes and data mining struc-
tures in formats customized for different audiences, such as complex reports for business
analysts and summary reports for executives.

Here is a partial list of Microsoft products that include integration with Analysis Services
OLAP cubes and data mining models:

■■ Microsoft Office Excel 2007 Many companies already own Office 2007, so using Excel
as a BI client is often attractive for its low cost and relatively low training curve. In addi-
tion to being a client for SSAS OLAP cubes through the use of the PivotTable feature or
the Data Mining Add-ins for SQL Server 2008, Excel can also be a client for data mining
structures. (Note that connecting to an OLAP data source from Excel 2003 only does
require that MS Query be installed. MS Query is listed under optional components on
the Office installation DVD.)
■■ Microsoft Word 2007 SSRS reports can be exported as Word documents that are
compatible with Office 2003 or 2007.
■■ Microsoft Visio 2007 Using the Data Mining Add-ins for SQL Server 2008, you can
create customized views of data mining structures.
■■ Microsoft Office SharePoint Server 2007 Office SharePoint Server 2007 contains
both specialized Web site templates designed to show reports (with the most common
one being the Report Center) as well as Web Parts that can be used to display indi-
vidual reports on Office SharePoint Server 2007 Web pages. A Web Part is a pluggable
user interface (UI) showing some bit of content. It is installed globally in the SharePoint
22 Part I Business Intelligence for Business Decision Makers and Architects

Portal Server Web site and can be added to an Office SharePoint Server 2007 portal
Web page by any user with appropriate permissions.
■■ Microsoft PerformancePoint Server PerformancePoint Server allows you to quickly
create a centralized Web site with all of your company’s performance metrics. The envi-
ronment is designed to allow business analysts to create sophisticated dashboards that
are hosted in a SharePoint environment. These dashboards can contain SSRS reports
and visualizations of data from OLAP cubes as well as other data sources. It also has a
strong set of products that support business forecasting.
PerformancePoint Server includes the functionality of the Business Scorecard Manager
and ProClarity Analytics Server. Its purpose is to facilitate the design and hosting of
enterprise-level scorecards via rich data-visualization options such as charts and reports
available in Reporting Services, Excel, and Visio. PerformancePoint Server also includes
some custom visualizers, such as the Strategy Map.

Note We are sometimes asked what happened to ProClarity, a company that had provided
a specialized client for OLAP cubes. Its target customer was the business analyst. Microsoft
acquired ProClarity in 2006 and has folded features of its products into PerformancePoint Server.

Microsoft also offers other products—such as Dynamics, Project Server, and BizTalk Server—
that use the Analysis Services storage mechanism and query engine. In addition to Microsoft
products that are designed to integrate with the BI tools available in SQL Server 2008, you
might elect to use some other developer products to improve productivity if your project’s
requirements call for .NET coding. Recall that the primary development tool is BIDS and that
BIDS does not require a Visual Studio installation. Microsoft has found that BI developers are
frequently also .NET developers, so most of them already have Visual Studio 2008. As was
mentioned, in this situation, installing SSAS, SSIS, or SSRS installs the associated developer
templates into the default Visual Studio installation.

Another consideration is the management of source code for large or distributed BI develop-
ment teams. In this situation, you can also elect to add Visual Studio Team System (VSTS) for
source control, automated testing, and architectural planning.

The data that you integrate into your BI solution might originate from relational sources.
These sources can, of course, include SQL Server 2008. They can also include nearly any type
of relational data—SQL Server (all versions), Oracle, DB2, Informix, and so forth. It is also com-
mon to include nonrelational data in BI solutions. Sources for this data can include Microsoft
Access databases, Excel spreadsheets, and so forth. It is also common to include text data
(often from mainframes). This data is sometimes made available as XML. This XML might or
might not include schema and mapping information. If complex XML processing is part of
your requirements, you can elect to use BizTalk Server to facilitate flexible mapping and load-
ing of this XML data.
Chapter 1 Business Intelligence Basics 23

You might be thinking at this point, “Wow, that’s a big list! Am I required to buy (or upgrade
to) all of those Microsoft products in order to implement a BI solution for my company?” The
answer is no. The only service that is required for an OLAP BI solution is SSAS. Also, many
companies provide tools that can be used as part of a Microsoft BI solution. Although we
occasionally refer to some third-party products in this book, we’ll focus primarily on using
Microsoft’s products and tools to build a BI solution.

Query Languages used in BI Solutions


When working with BI solutions built on SSAS cubes and data mining structures, you use
several query languages. The primary query language for OLAP cubes is MDX. SSAS also
includes the ability to build data mining structures. To query the data in these structures, you
use DMX. XMLA is a specialized administrative scripting language used with SSAS objects
(OLAP cubes, SSIS packages, and data mining structures). Finally, RDL is the XML dialect
behind SSRS reports. In the following sections, we briefly describe each language and pro-
vide a sample.

MDX
MDX, which stands for Multidimensional Expressions, is the language used to query OLAP
cubes. Although MDX is officially an open standard and some vendors outside of Microsoft
have adopted parts of it in their BI products, the reality is that comparatively few working
.NET developers are proficient in MDX. A mitigating factor is that the need to manually write
MDX in a BI solution can be relatively small—with not nearly as much Transact-SQL as you
would manually write for a typical OLTP database. However, retaining developers who have
at least a basic knowledge of MDX is an important consideration in planning a BI project.
We review core techniques as well as best practices for working with MDX in Chapter 10,
“Introduction to MDX,” and Chapter 11, “Advanced MDX.”

The MDX query language is used to retrieve data from SSAS cubes. A simple MDX query
is shown in Figure 1-13. Although MDX has an SQL-like structure, it is far more difficult to
master because of the complexity of the SSAS source data structures—which are multidimen-
sional OLAP cubes.

FIgure 1-13 A sample MDX query


24 Part I Business Intelligence for Business Decision Makers and Architects

DMX
Data Mining Extensions (DMX) is used to query Analysis Services data mining structures. (We
devote several future chapters to design and implementation of SSAS data mining structures.)
Although this language is based loosely on Transact-SQL, it contains many elements that are
unique to the world of data mining. As with MDX, very few working .NET developers are pro-
ficient in DMX. However, the need for DMX in BI solutions is relatively small because the SSAS
data mining structure in BIDS provides tools and wizards that automatically generate DMX
when you create those structures. Depending on the scope of your solution, retaining devel-
opers who have at least a basic knowledge of DMX might be an important consideration in
planning a BI project that includes a large amount of data mining. A simple DMX query is
shown in Figure 1-14.

FIgure 1-14 Sample DMX query

XMLA
XML for Analysis (XMLA) is used to perform administrative tasks in Analysis Services. It is an
XML dialect. Examples of XMLA tasks include viewing metadata, copying, backing up data-
bases, and so on. As with MDX and DMX, this language is officially an open standard, and
some vendors outside of Microsoft have chosen to adopt parts of it in their BI products.
Again, the reality is that very few developers are proficient in XMLA. However, you will sel-
dom author any XMLA from scratch; rather, you’ll use the tools and wizards inside SQL Server
2008 to generate this metadata. In SSMS, when connected to SSAS, you can right-click on
any SSAS object and generate XMLA scripts using the graphical user interface (GUI). XMLA is
used to define SSAS OLAP cubes and data mining structures.

RDL
RDL, or the Report Definition Language, is another XML dialect that is used to create
Reporting Services reports. As with the other BI languages, RDL is officially an open standard,
Chapter 1 Business Intelligence Basics 25

and some vendors outside of Microsoft have chosen to adopt parts of it in their BI prod-
ucts. You rarely need to manually write RDL in a BI solution because it is generated for you
automatically when you design a report using the visual tools in BIDS. We’ll review core tech-
niques as well as best practices for working with RDL in future chapters.

Summary
In this chapter, we covered basic data warehousing terms and concepts, including BI, OLTP,
OLAP, dimensions, and facts (or measures). We defined each term so that you can bet-
ter understand the possibilities you should consider when planning a BI solution for your
company.

We then introduced the BI tools included with SQL Server 2008. These include SSAS, SSIS,
SSRS, and Data Mining Add-ins. For each of these BI tools, we defined what parts of a BI
solution’s functionality that particular tool could provide. Next, we discussed other Microsoft
products that are designed to be integrated with BI solutions built using SSAS OLAP cubes
or data mining structures. These included Excel, Word, and Office SharePoint Server 2007.
We also touched on the integration of SSAS into PerformancePoint Server. We concluded our
conceptual discussion with a list and description of the languages involved in BI projects.

In Chapter 2, we work with the sample database AdventureWorksDW, which is included in


SQL Server 2008, so that you get a quick prototype SSAS OLAP cube and data mining struc-
ture up and running. This is a great way to begin turning the conceptual knowledge you’ve
gained from reading this chapter into practical understanding.
Chapter 2
Visualizing Business
Intelligence Results
As you learn more about the business intelligence (BI) tools available in Microsoft SQL Server
2008 and other Microsoft products that integrate BI capabilities (such as Microsoft Office
Excel 2007), you can begin to match features available in these products to current chal-
lenges your organization faces in its BI implementation and in future BI solutions you’re
planning. In this chapter, we summarize the most common business challenges and typical BI
solutions using components available in SQL Server 2008 or BI-enabled Microsoft products.

We also preview the most common way we help clients visualize the results of their BI
projects. If you’re new to business intelligence and are still digesting the components and
concepts covered in Chapter 1, “Business Intelligence Basics,” you should definitely read this
chapter. If you have a good foundation of business intelligence knowledge but are wonder-
ing how to best translate your technical and conceptual knowledge into language that busi-
ness decision makers can understand so that they will support your project vision, you should
also find value in this chapter.

Matching Business Cases to BI Solutions


Now that you’ve begun to think about the broad types of business challenges that BI
solutions can provide solutions for, the next step is to understand specific business case
scenarios that can benefit from SQL Server 2008 BI solutions. A good starting place is
http://microsoft.com/casestudies. There you can find numerous Microsoft case studies—from
industries as varied as health care, finance, education, manufacturing, and so on. As you are
furthering your conceptual understanding of Microsoft’s BI tools, we suggest that you search
for and carefully review any case studies that are available for your particular field or business
model. Although this approach might seem nontechnical, even fluffy, for some BI developers
and architects, we find it to be a beneficial step in our own real-world projects.

We’ve also found that business decision makers are able to better understand the possibilities
of a BI solution based on SQL Server 2008 after they have reviewed vertically aligned case
studies. Finding a specific reference case that aligns with your business goals can save you
and your developers a lot of time and effort.

For example, we’ve architected some BI solutions with SQL Server 2005 and 2008 in the
health care industry and we’ve used the case study for Clalit Health Care HMO (Israel) several

27
28 Part I Business Intelligence for Business Decision Makers and Architects

times with health care clients. This case study is found at http://www.microsoft.com/industry/
healthcare/casestudylibrary.mspx?casestudyid=4000002453.

A good place to start when digesting case studies is with the largest case study that
Microsoft has published. It’s called Project Real and is based on the implementation of the
complete SQL Server 2005 BI toolset at Barnes and Noble Bookstores. Microsoft is currently
in the process of documenting the upgrade to SQL Server 2008 BI in the public case study
documentation. This case study is a typical retail BI study in that it includes standard retail
metrics, such as inventory management and turnover, as well as standard profit and loss met-
rics. It also covers commonly used retail calendars, such as standard, fiscal, and 4-5-4. You
can find the case study and more information about it at http://www.microsoft.com/technet/
prodtechnol/sql/2005/projreal.mspx.

This case study version is also interesting because Microsoft has worked with Barnes and
Noble on this enterprise-level BI project through three releases of SQL Server (2000, 2005,
and 2008). There is a case study for each version. It’s also interesting because the scale of
the project is huge—in the multiterabyte range. Microsoft has also written a number of
drill-down white papers based on this particular case study. We’ve used information from
these white papers to help us plan the physical and logical storage for customers that we’ve
worked with who had cubes in the 50-GB or greater range.

Also, several teams from Microsoft Consulting Services have developed a BI software life
cycle using the Barnes and Noble Bookstores example. We have used the life cycle approach
described in the Project Real case study in several of our own projects and have found it
to be quite useful. We’ve found it so useful, in fact, that we’ll detail in Chapter 3, “Building
Effective Business Intelligence Processes,” exactly how we’ve implemented this life cycle
approach throughout the phases of a BI project.

Another good case study to review is the one describing the use of the BI features of SQL
Server 2008 within Microsoft internally. This case study can be found at http://www.micro­
soft.com/casestudies/casestudy.aspx?casestudyid=4000001180. In addition to providing a
proof-of-concept model for a very large data warehouse (which is estimated to grow to 10
TB in the first year of implementation), this case study also shows how new features—such as
backup compression, change data capture, and more—were important for an implementa-
tion of this size to be effective.

Your goal in reviewing these reference implementations is to begin to think about the scope
of the first version of your BI project implementation. By scope, we mean not only the types
of data and the total amount of data you’ll include in your project, but also which services
and which features of those services you’ll use.
Chapter 2 Visualizing Business Intelligence Results 29

We’ll also briefly describe some real-world projects we’ve been involved with so that you
can get a sense of the ways to apply BI. Our experience ranges from working with very small
organizations to working with medium-sized organizations. We’ve noticed that the cost
to get started with BI using SQL Server 2008 can be relatively modest (using the Standard
edition of SQL Server 2008) for smaller organizations jumping into the BI pool. We’ve also
worked with public (that is, government) and private sector businesses, as well as with non-
profit organizations. In the nonprofit arena, we’ve built cubes for organizations that want to
gain insight into their contributor bases and the effectiveness of their marketing campaigns.

A trend we’ve seen emerging over the past couple of years is the need to use BI for busi-
nesses that operate in multiple countries, and we’ve worked with one project where we had
to localize both metadata (cube, dimension, and so on) names and, importantly, measures
(the most complicated of which was currency). Of course, depending on the project, localiza-
tion of measures can also include converting U.S. standard measures to metric. We worked
with a manufacturing company whose initial implementation involved U.S. measurement
standards only; however, the company asked that our plan include the capability to localize
measures for cubes in its next version.

Another interesting area in which we’ve done work is in the law enforcement field. Although
we’re not at liberty to name specific clients, we’ve done work with local police to create more
efficient, usable offender information for officers in the field. We’ve also done work at the
state and federal government levels, again assisting clients to develop more efficient access
to information about criminal offenders or potential offenders. In somewhat similar projects,
we worked with social services organizations to assist them in identifying ineffective or even
abusive foster care providers.

Sales and marketing is, of course, a mainstay of BI. We’ve done quite a bit of work with
clients who were interested in customer profiling for the purpose of increasing revenues.
Interestingly, because of our broad experience in BI, we’ve sometimes helped our bricks-and-
mortar clients to understand that BI can also be used to improve operational efficiencies,
which can include operational elements such as time-to-process, time-to-ship, loss preven-
tion information, and much more.

Our point in sharing our range of experience with you is to make clear to you that BI truly is
for every type of business. Because BI is so powerful and so broad, explaining it in a way that
stakeholders can understand is critical. Another tool that we’ve evolved over time is what we
call our Top 10 BI Scoping Questions. Before we move into the visualization preview, we’ll list
these questions to get you thinking about which questions you’ll want to have answered at
the beginning of your project.
30 Part I Business Intelligence for Business Decision Makers and Architects

Top 10 BI Scoping Questions


Without any further fanfare, here is the Top 10 BI Scoping Questions list:

1. What are our current pain points with regard to reporting? That is, is the process slow,
is data missing, is the process too rigid, and so on?
2. What data source are we currently unable to get the information from?
3. Who in our organization needs access to which data?
4. What type of growth—that is, merger, new business, and so on—will affect our report-
ing needs over the next 12 months?
5. What client tools are currently being used for reporting? How effective are these tools?
6. How could our forecasting process be improved? For example, do we need more infor-
mation, more flexibility, or more people having access?
7. What information do we have that doesn’t seem to be used at all?
8. Which end-user groups currently have no access to key datasets or limited access to
them?
9. How satisfied are we with our ability to execute “What if” and other types of forecasting
scenarios?
10. Are we using our data proactively?

We’re aware that we haven’t provided you with enough information yet to design and build
solutions to answer these questions. The foundation of all great BI solutions is formed by
asking the right questions at the beginning of the project so that you can build the solution
needed by your organization. Because we’re laying much of the groundwork for your proj-
ect in this chapter, we’ll mention here the question you’ll invariably face early in your project
cycle:

“If BI is so great, why isn’t everybody using it?”

We specifically addressed this question in the introduction to this book. If you’re at a loss
to answer it, you might want to review our more complete answer there. Here’s the short
answer:

“BI was too difficult, complicated, and expensive to be practical for all but the largest businesses
prior to Microsoft’s offerings, starting with SQL Server 2005.”

We have one last section to present before we take a closer look at BI visualization. It intro-
duces some of the pieces we’ll be exploring in this book. It can be quite tricky to correctly
position (or align) BI capabilities into an organization’s existing IT infrastructure at the begin-
ning of a project, particularly when Microsoft’s BI is new to a company. Although it’s true BI is
a dream to use after it has been set up correctly, it’s also true that there are many parts that
Chapter 2 Visualizing Business Intelligence Results 31

you, the developer, must understand and set up correctly to get the desired result. This can
result in the classic dilemma of overpromising and underdelivering.

Although we think it’s critical for you to present BI to your business decision makers (BDMs)
in terms of end-user visualizations so that those BDMs can understand what they are getting,
it’s equally important for you to understand just what you’re getting yourself into! To that
end, we’ve included the next section to briefly show you the pieces of a BI solution.

Components of BI Solutions
Figure 2-1 shows that a BI solution built on all possible components available in SQL Server
2008 can have many pieces.

We’ll spend the rest of this book examining the architectural components shown in Figure
2-1. In addition to introducing each major component and discussing how the compo-
nents relate to each other, in Part I, “Business Intelligence for Business Decision Makers and
Architects,” we discuss the design goals of the BI platform and how its architecture realizes
these goals. We do this in the context of the major component parts—that is, SQL Server
Analysis Services (SSAS), SQL Server Integration Services (SSIS), and SQL Server Reporting
Services (SSRS). Each component has an entire section of the book dedicated to its imple-
mentation details.

To combat the complexity of implementing solutions built using BI concepts (OLAP cubes
or data mining models), Microsoft has placed great emphasis on usability for BI developers
and administrators in SQL Server 2008. Although SSAS, SSIS, and SSRS already included a
large number of tools and wizards to facilitate intelligent and quick implementation of each
in SQL Server 2005, nearly all of these tools and wizards have been reviewed, reworked, and
just generally improved in SQL Server 2008. In addition, there are new tools and utilities
to improve usability for developers. One important example is the Analysis Management
Objects (AMO) warnings. This feature (shown in Figure 2-2) displays warnings in the Business
Intelligence Development Studio (BIDS) Error List window when developers implement OLAP
cube designs that are contrary to best practices. The tool is configurable so that you can turn
off warning types that you want to ignore.

As we continue on our journey of understanding BI, we’ll not yet get into an architectural dis-
cussion of the BI suite; rather, we’ll use an approach that we’ve had quite a bit of success with
in the real world. That is, we’ll examine the results of a BI solution through the eyes of a typi-
cal end user. You might be surprised by this approach, thinking, “Hey, wait. I’m a developer!”
Bear with us on this one; seeing BI from an end-user perspective is the most effective way for
anyone involved in a BI project to learn it.
32 Part I Business Intelligence for Business Decision Makers and Architects

Relational Web
File
Database Service

Instance
Storage Engine

Plug-in
Data Data
Formula
Mining

(Memory, Data, Lock)

Monitoring Manager
(Trace, Profile, Event)
Mining Engine

Resource Manager
Algorithms Engine

Native
Data
Mining
Algorithms
AdomdServer
Stored
Procedures
Protocol Manager

SML for Analysis


Listener (HTTP/TCP)

Local
AdomdClient Analysis
Services
MSOLAP90 Engine
Provider

ADO.NET
Local
Storage
ADO MD

AMO
Askeymgmt Deployment
Utility Engine
DSO

Deployment Business SQL SQL Client


Utility Intelligence Profiler Server Application
Development Management
Studio Studio

FIgure 2-1 BI component architecture from SQL Server Books Online


Chapter 2 Visualizing Business Intelligence Results 33

FIgure 2-2 AMO warnings for an OLAP cube in BIDS

What we mean by this is that whether you’re a developer who is new to BI, or even one who
has had some experience with BI, we find over and over that the more you understand about
the end-user perspective, the better you can propose and scope your envisioned BI project,
and ultimately gain appropriate levels of management support for it. Also, like it or not, part
of your work in a BI project will be to teach BDMs, stakeholders, and other end users just
what BI can do for them. We have found this need to be universal to every BI project we have
implemented to date.

Because of the importance of understanding current visualization capabilities in Microsoft’s


BI products, we’re going to devote the remainder of this chapter to this topic. If you’re look-
ing for a more detailed explanation of exactly how you’ll implement these (and other) client
solutions, hold tight. We devote an entire section of this book to that topic. First you have to
see the BI world from the end-user’s perspective so that you can serve as an effective “trans-
lator” at the beginning of your project.

Tip For each BI component you want to include in your particular BI project, select or design
at least one method for envisioning the resulting information for the particular type of end-user
group involved—that is, analysts, executives, and so on. Components include OLAP cubes, data
mining structures, and more.
34 Part I Business Intelligence for Business Decision Makers and Architects

understanding Business Intelligence from


a user’s Perspective
As we mentioned already, visualizing the results of BI solutions is one of the more difficult
aspects of designing and building BI solutions. One of the reasons for this is that we really
don’t have the ability to “see” multidimensional structures (such as OLAP cubes) that we
build with SSAS. Figure 2-3 is somewhat helpful because it shows data arranged in a cube.
However, understand that this visualization reflects only a small subset of what is possible.

The figure shows an OLAP cube with only three dimensions (or aspects) and only one mea-
sure (or fact). That is, this cube provides information about the following question: “How
many packages were shipped via what route, to which location, at what time?” Real-world
cubes are far more complex than this. They often contain dozens or hundreds of dimensions
and facts. This is why OLAP cubes are called n-dimensional structures. Also, OLAP cubes are
not really cube-shaped because they include the efficiency of allocating no storage space for
null intersection points. That is, if there is no value at a certain intersection of dimensional
values—for example, no packages were shipped to Africa via air on June 3, 1999—no storage
space is needed for that intersection. This, in effect, condenses cubes to more of a blob-like
structure. Another way to think about it is to understand that if data from a relational source
is stored in the native cube storage format (called MOLAP), it is condensed to about one-
third of the original space. We’ll talk more about storage in Chapter 9, “Processing Cubes and
Dimensions.”

So how do you visualize an n-dimensional structure? Currently, most viewers implement


some form of a pivot-table-like interface. We’ll show you a sample of what you get in Excel
shortly. Note that although a pivot-table-like interface might suffice for your project, both
Microsoft and independent software vendors (ISVs) are putting a lot of effort into improving
BI visualization. Although there are some enhanced visualization components already avail-
able in the market, we do expect major improvements in this critical area over the next two
to three years. An example of a commercial component is CubeSlice. As with most, but not
all, commercial components, CubeSlice can be integrated into Excel.

If your BI project involves huge, complex cubes, it will be particularly important for you to
continue to monitor the market to take advantage of newly released solutions for BI visu-
alization. One area we pay particular attention to is the work that is publicly available from
Microsoft Research (MSR). MSR includes one dedicated group—Visualization and Interaction
for Business and Entertainment (VIBE)—whose purpose is to invent or improve data visualiza-
tion tools in general. This group sometimes releases its viewers for public consumption on
its Web site at http://research.microsoft.com/vibe. Another source from which we get insight
into the future of BI data visualization is the annual TED (Technology, Entertainment, Design)
conference. One particular talk we recommend is that of Hans Rosling, which shows global
health information in an interesting viewer. You can watch his speech at http://www.ted.com/
index.php/talks/hans_rosling_reveals_new_insights_on_poverty.html.
Chapter 2 Visualizing Business Intelligence Results 35

Rail
Ground
Route Road
Sea
Nonground
Air

190 215 160 240


Africa
Feb-17-99 Apr-22-99 Sep-07-99 Dec-01-99
510 600 520 780
Asia
Eastern Mar-19-99 May-31-99 Sep-18-99 Dec-22-99
Hemisphere
210 240 300 410
Australia
Mar-05-99 May-19-99 Aug-09-99 Nov-27-99
Source 500 470 4644 696
Europe
Mar-07-99 Jun-w0-99 Sep-11-99 Dec-15-99
North 3056 4050 4360 5112
Western America Mar-30-99 Jun-28-99 Sep-30-99 Dec-29-99
Hemisphere 600 490 315 580
South
America Feb-27-99 Jun-03-99 Aug-21-99 Nov-30-99
1st 2nd 3rd 4th
Quarter Quarter Quarter Quarter
Measures
Packages 1st Half 2nd Half
Last
Time

FIgure 2-3 Conceptual picture of an OLAP cube (from SQL Server Books Online)

As with OLAP cubes, data mining structures are difficult to visualize. It’s so difficult to visual-
ize these structures that SQL Server Books Online provides no conceptual picture. Data min-
ing structures also have a “shape” that is not relational. We’ll take a closer look at data mining
mechanics (including storage) in Chapter 12, “Understanding Data Mining Structures,” and
Chapter 13, “Implementing Data Mining Structures.”

As we mentioned, BIDS itself provides viewers for both OLAP cubes and data mining struc-
tures. You might wonder why these viewers are provided, when after all, only you, the devel-
oper, will look at them. There are a couple of reasons for this.

The first reason is to help you, the developer, reduce the visualization problems associated
with building both types of structures. Figure 2-4 provides an example of one of these visu-
alizers, which shows the results of viewing a sample data mining model (part of a structure)
using one the built-in tools in BIDS. What you see in the figure is the result of a data mining
algorithm that shows you time series forecasting—that is, what quantity of something (in this
case, a bicycle) is predicted to be sold over a period of time. The result comes from applying
the time series algorithm to the existing source data and then putting out a prediction for x
number of future time periods.
36 Part I Business Intelligence for Business Decision Makers and Architects

FIgure 2-4 A visualizer in BIDS for a data mining time series algorithm

The second reason BIDS provides these visualizers is because you, the developer, can directly
use them in some client implementations. As you’ll see shortly, the viewers included in BIDS
and SSMS are nearly identical to those included in Excel 2007. These viewers are also avail-
able as embeddable controls for custom Windows Forms applications.

Another way to deepen your understanding of BI in SQL Server 2008 is for you to load the
sample applications and to review all the built-in visualizers for both OLAP cubes and data
mining structures that are included with BIDS. We’re going to take this concept a bit further
in the next section, showing you what to look at in the samples in the most commonly used
end-user interface for SSAS, Excel 2007.

Demonstrating the Power of BI Using Excel 2007


As you learn about the power and potential of BI, a logical early step is to install the samples
and then take a look at what is shown in BIDS. Following that, we recommend that you
Chapter 2 Visualizing Business Intelligence Results 37

connect to both OLAP cubes and data mining models using Excel 2007. The reason you
should view things in this order is to get some sense of what client tools can or should look
like for your particular project.

We’re not saying that all Microsoft BI solutions must use Excel as a client. In our experience,
most solutions will use Excel. However, that might not be appropriate for your particular busi-
ness needs. We always find using Excel to support quick prototyping and scope checking to
be valuable. Even if you elect to write your own client application—that is, by using Windows
Forms or Web Forms—you should familiarize yourself with the Excel viewers, because most
of these controls can be used as embeddable controls in your custom applications.

We’ll focus on getting this sample set up as quickly as possible in the next section. To that
end, we will not be providing a detailed explanation of why you’re performing particular
steps—that will come later in the book.

Next, we’ll give you the steps to get the included samples up and running. At this point, we’re
going to focus simply on clicks—that is, we’ll limit the explanation to “Click here to do this”
and similar phrasing. The remaining chapters explain in detail what all this clicking actually
does and why you click where you’re clicking.

Building the First Sample—Using AdventureWorksDW2008


To use the SQL Server 2008 AdventureWorksDW2008 sample database as the basis for
building an SSAS OLAP cube, you need to have at least one machine with SQL Server 2008
and SSAS installed on it. While installing these applications, make note of the edition of
SQL Server you’re using—(you can use either the Developer, Standard, or Enterprise edi-
tion) because you’ll need to know the particular edition when you install the sample cube
files. All screens and directions in this chapter apply to the Enterprise (or Developer) edition
of SQL Server 2008. The Enterprise and Developer editions have equivalent features. As we
dig deeper into each of the component parts of data warehousing—that is SSAS, SSIS, and
SSRS—we’ll discuss feature differences by edition (that is, Standard, Enterprise, and so on).

Tip In the real world, we’ve often set up a test configuration using Virtual PC. A key reason we
chose to use Virtual PC is for its handy Undo feature. We find being able to demonstrate a test
configuration for a client and then roll back to a clean install state by simply closing Virtual PC
and selecting Do Not Save Changes is very useful and saves time. Virtual PC is a free download
from this URL: http://www.microsoft.com/downloads/details.aspx?familyid=04D26402­3199­
48A3­AFA2­2DC0B40A73B6&displaylang=en. The Do Not Save Changes feature requires that you
install the virtual machine additions, which are part of the free download but are not installed by
default.

If you’re installing SQL Server, remember that the sample databases are not installed by
default. A quick way to get the latest samples is to download (and install) them from the
38 Part I Business Intelligence for Business Decision Makers and Architects

CodePlex Web site. Be sure to download the samples for the correct version and edition of
SQL Server.

What is CodePlex? Located at http://www.codeplex.com, the CodePlex Web site is a code


repository for many types of projects. The SQL Server samples are one of thousands of open-
source projects that Microsoft hosts via this site. The site itself uses a Team Foundation Server
to store and manage project source code. The site includes a Web-based user interface. Use
of CodePlex is free, for both downloading and posting open-source code.

Note that you’ll need two .msi files for building the sample SSAS project, which can be down-
loaded from http://www.codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx?­
ReleaseId=16040. You download two files for whatever hardware you’re working with—that
is, x86, IA64, or x64 installations. The names of the files are SQL2008.AdventureWorks DW
BI v2008.xNN.msi and SQL2008.AdventureWorks All DB Scripts.xNN.msi. (Replace the letters
NN with the type of hardware you’re using—that is, replace it with 86 if you’re using x86,
with 64 if you’re using x64, and so on.) After you install these two files, you’ll get a folder
that contains several files named AdventureWorks Analysis Services Project. The sample
files for SSAS are located at C:\Program Files\Microsoft SQL Server\100\Tools \Samples\
AdventureWorks 2008 Analysis Services Project\Enterprise.

Note There are two versions of the sample files. These are unpacked into two separate folders:
Enterprise and Standard. We’ll work with the Enterprise folder in this book.

The file you’ll use is the Visual Studio solution file called Adventure Works.sln. Before you
double-click it to open BIDS, you must perform one more setup step.

The sample database you’ll use for building the sample project is AdventureWorksDW2008.
You’ll use AdventureWorksDW2008, rather than AdventureWorks2008, as the source data-
base for your first SSAS OLAP cube because it’s modeled in a way that’s most conducive
to creating cubes easily. In Chapter 5, “Logical OLAP Design Concepts for Architects,” and
Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” we’ll discuss in detail what
best-practice modeling for SSAS cubes consists of and how you can apply these modeling
techniques to your own data.

To set up AdventureWorksDW2008, double-click on this .msi file: SQL2008.AdventureWorks


All DB Scripts.xNN.msi. This unpacks the scripts for both the OLTP sample database called
AdventureWorks and the dimensionally modeled sample database called Adventure-
WorksDW to your computer. After this has succeeded, run BuildAdventureWorks.cmd
(located in C:\Program Files\Microsoft SQL Server\100\Tools\Samples) from the command
prompt to restore the databases. This will run a series of scripts through SSMS.
Chapter 2 Visualizing Business Intelligence Results 39

Note You can optionally choose to install another sample database, called AdventureWorksLT,
for other testing purposes. This requires a separate download from CodePlex. AdventureWorksLT
is a lightweight (or simple) OLTP database and is generally not used for data warehouse testing
and modeling.

Verify that AdventureWorksDW2008 has installed correctly by opening SQL Server Man-
agement Studio and connecting to the SQL Server instance where you’ve installed it. You’ll
see this database listed in the tree view list of objects, called Object Explorer, as shown in
Figure 2-5. You might notice that the table names have a particular naming convention—
most names start with the prefixes “Dim” or ”Fact.” If you guessed that these names have to
do with dimensions and facts, you’d be correct! We’ll explore details of OLAP data source
modeling such as these later in the book.

FIgure 2-5 AdventureWorksDW2008 in SSMS

Opening the Sample in BIDS


Now you’re ready to open the sample in BIDS. As mentioned previously, at this point in our
discussion we are not looking to explain the BIDS interface. An entire section of this book is
devoted to that. For now, we want to simply open the sample, process (or build) the OLAP
cube and data mining structures, and then connect to what we’ve built using the included
tools in Excel 2007.
40 Part I Business Intelligence for Business Decision Makers and Architects

The steps to do this are remarkably quick and easy. Simply navigate to the sample folder
location listed earlier in this section, and then double-click on the Visual Studio solution
file called Adventure Works.sln. This opens BIDS and loads the OLAP cube and data mining
structure metadata into the Solution Explorer window in BIDS. You’ll see a number of files
in this window. These files show the objects included in the sample. The most important of
these files are two sample OLAP cubes and five sample data mining structures, which are
shown in Figure 2-6. The screen shot shows the sample named Adventure Works DW 2008,
which was the name of the sample for all beta or community technology preview (CTP) ver-
sions of SQL Server 2008. The final, or release-to-manufacturing (RTM), sample name is
Adventure Works 2008. We’ve updated screen shots to reflect the final sample in this book.
However, where there were no changes between the CTP and RTM samples, we used the
original screen shots, so you can disregard the project name discrepancies.

FIgure 2-6 Cubes and data mining structures in BIDS


Chapter 2 Visualizing Business Intelligence Results 41

Note You might be wondering whether BIDS is the same thing as Visual Studio and, if so, does
that mean the data warehouse development requires an installation of full-blown Visual Studio
on each developer’s machine? The answer is no—BIDS is a subset of Visual Studio. If Visual Studio
is already installed, BIDS installs itself as a series of templates into the Visual Studio instance. If
Visual Studio is not installed, the mini version of Visual Studio that includes only BIDS functional-
ity is installed onto the developer’s computer.

After you’ve opened the project in BIDS, the only other step you need to perform to make
these samples available for client applications is to deploy these objects to the SSAS server.
Although you can do this in BIDS by simply right-clicking the solution name (Adventure
Works DW) in the Solution Explorer window and then choosing Deploy from the menu, there
is quite a bit of complexity behind this command.

If you expand the Deployment Progress window in BIDS, you’ll get a sense of what is hap-
pening when you deploy a BIDS project. This window shows detailed information about
each step in the build and deploy process for an OLAP cube. Roughly, metadata for the
objects is validated against a schema, with dimensions being processed first. Next, cubes
and mining models are processed, and then data from the source system (in this case, the
AdventureWorksDW database) is loaded into the structures that have been created on SSAS.

Tip If you get a “deployment failed” warning in the Deployment Progress window, check your
connection strings (known as data sources in BIDS). The sample connects to an instance of SQL
Server at localhost with credentials as configured by the default string. You might need to update
these connection string values to reflect your testing installation of SSAS. You might also need to
update the project settings to point to your instance of SSAS by right-clicking on the project in
Solution Explorer and choosing Properties. In the resulting dialog box, choose the Deployment
option and verify that the Server option points to the correct instance of SSAS.

Now you’re ready to validate the sample OLAP cube and data mining structures using the
built-in browsers in BIDS. The browser for the OLAP cube looks much like a pivot table. As we
mentioned, this browser is included so that you, as a cube developer, can review your work
prior to allowing end users to connect to the cube using client BI tools. Most client tools con-
tain some type of pivot table component, so the included browsers in BIDS are useful tools
for you. To be able to view the sample cube using the built-in cube browser in BIDS, double-
click on the Adventure Works cube object in the Cubes folder of the Solution Explorer win-
dow of BIDS.

To take a look at the cube using the browser in BIDS, click on the Browser tab on the cube
designer surface of BIDS. This opens the OLAP cube browser. Depending on the hardware
capabilities of your test machine, this step might take a couple of seconds to complete. The
designer surface looks much like a pivot table. To start, select the measures and dimensions
you’d like displayed on the designer surface. You do this by selecting them and dragging
42 Part I Business Intelligence for Business Decision Makers and Architects

them to the appropriate position on the designer surface. The interface guides you, issuing a
warning if you drag a type of item to an area that cannot display it.

Just to get you started, we’ve included Figure 2-7, which shows you the results of dragging
the Internet Average Sales Amount measure from the Internet Sales measure group (folder)
to the center of the pivot table design area. Next we dragged the Customer Geography hier-
archy from the Customer dimension to the rows axis (left side) position. Finally, we dragged
the Product Categories hierarchy from the Product dimension to the columns axis (top side)
position. The results are shown in Figure 2-7.

FIgure 2-7 BIDS OLAP cube browser

Tip To remove an item from the browser, simply click on the column or row header of the item
and then drag it back over the tree listing of available objects. You’ll see the cursor change to an
x to indicate that this value will be deleted from the browser.

Now that you’ve got this set up, you might want to explore a bit more, dragging and drop-
ping values from the tree listing on the left side of the cube browser design area to see
what’s possible. We encourage you to do this. What is happening when you drag and drop
each object from the tree view on the left side of BIDS to the design area is that an MDX
query is being generated by your activity and is automatically executed against the SSAS
OLAP cube.

Note also that you can drag row headers to the columns area and vice versa—this is called
pivoting the cube. Also, spend some time examining the Filter Expression section at the top of
the browser. In case you were wondering, when it comes time for you to look at and under-
stand the MDX queries being generated, there are many ways to view those queries. At this
point in our explanation, however, we are not yet ready to look at the query syntax. Note
that many meaningful report-type result sets can be generated by simply clicking and drag-
ging, rather than by you or another developer manually writing the query syntax for each
query. And this is exactly the point we want to make at this juncture of teaching you about
OLAP cubes.
Chapter 2 Visualizing Business Intelligence Results 43

Note In case you were wondering whether you can view sample data mining structures in BIDS,
the answer is yes. The AdventureWorks samples include data mining structures. Each structure
contains one or more data mining models. Each mining model has one or more viewers available
in BIDS. Later in this chapter, we’ll talk a bit more about how to see mining model results using
the data mining viewers. For now, however, we’ll continue our exploration of OLAP cubes, mov-
ing on to the topic of using Excel 2007 as a client tool.

Connecting to the Sample Cube Using Excel 2007


Now that you’ve set up and deployed the sample cubes, it’s important for you to be able to
see things from an end user’s perspective. An easy way to do this is by using a PivotTable
view in Excel 2007. Open Excel 2007 and set up a connection to an SSAS OLAP cube using
the Data tab of the Ribbon. In the Get External Data group, click From Other Sources (shown
in Figure 2-8), and then click the associated From Analysis Services button.

FIgure 2-8 The Get External Data group on the Data tab of the Excel 2007 Ribbon

After you click the From Analysis Services button, a multistep wizard opens. Enter the name
of your SSAS instance, your connection credentials, and the location on the worksheet where
you want to place the resulting PivotTable and the data from the OLAP cube. After you click
Finish, the wizard closes and the newly designed PivotTable view opens on the Excel work-
book page.

Note The PivotTable view has been redesigned in Excel 2007. This redesign was based on
Microsoft’s observations of users of PivotTable views in previous versions of Excel. The focus in
Excel 2007 is on ease of use and automatic discoverability. This design improvement is particular-
ly compelling if you’re considering using an Excel PivotTable view as one of the client interfaces
for your BI solution, because usability is a key part of end-user acceptance criteria.

Because you’ve already spent some time exploring the sample OLAP cubes in the built-in
cube browser included in BIDS, you’ll probably find a great deal of similarity with the items in
the PivotTable Field List in Excel. This similarity is by design. To get started, set up the same
cube view that you saw earlier in the BIDS browser.

To do this, filter the available fields to show only fields related to Internet Sales by select-
ing that value in the drop-down list in the PivotTable Field List area. Next select the fields of
44 Part I Business Intelligence for Business Decision Makers and Architects

interest—in this case, the Internet Sales Amount item, Product Categories item (expand the
Product group to see this value), and Sales Territory item (expand the Sales Territory group).
Note that when you click in the check boxes to make your selections, the measure field
(Internet Sales Amount) is automatically added to the Values section of the PivotTable view,
the other two items are automatically added to the Row Labels section, and both are placed
on the rows axis of the PivotTable view. This is shown in Figure 2-9.

FIgure 2-9 PivotTable Field List in Excel 2007

To pivot the view of the data, you have several options. Probably the easiest way to accom-
plish this is to drag whichever fields you want to pivot between the different areas in the
bottom section of the PivotTable Field List. In this case, if you wanted to alter the layout we
created earlier using the BIDS cube browser, you could simply drag the Product Categories
button from the Row Labels section to the Column Labels section and then that change
would be reflected on the PivotTable designer surface.

You might also want to create a PivotChart view. Some people simply prefer to get infor-
mation via graphs or charts rather than by rows and columns of numbers. As you begin to
design your BI solution, you must consider the needs of all the different types of users of
your solution. To create a PivotChart view, simply click anywhere on the PivotTable view. Then
click on the Options tab of the Ribbon, and finally, click on the PivotChart button. Select a
chart type in the resulting Insert Chart window, and click OK to insert it. A possible result is
shown in Figure 2-10.
Chapter 2 Visualizing Business Intelligence Results 45

FIgure 2-10 A PivotChart in Excel 2007

As we mentioned earlier (when discussing using the OLAP sample cube with the built-in
viewer in the BIDS developer interface), we encourage you to play around with PivotTable
and PivotChart capabilities to view and manipulate SSAS OLAP cube data using Excel 2007.
The better you, as a BI solution architect and developer, understand and are able to visualize
what is possible with OLAP cubes (and relate that to BDMs and end users), the more effec-
tively you’ll be able to interpret business requirements and translate those requirements into
an effective BI solution. The next section of the chapter covers another important area: visu-
alizing the results of data mining structures.

Understanding Data Mining via the Excel Add-ins


The next part of your investigation of the included samples is to familiarize yourself with
the experience that end users have with the data mining structures that ship as part of the
AdventureWorksDW2008 sample. In our real-world experience, we find that less than 5 per-
cent of our BI clients understand the potential of SSAS data mining. We believe that the path
to providing the best solutions to clients is for us to first educate you, the developer, about
the possibilities and then to arm you with the techniques to teach your BI clients about what
is possible.

To that end, here’s some basic information to get you started. Each data mining structure
includes one or more data mining models. Each data mining model is based on a particular
data mining algorithm. Each algorithm performs a particular type of activity on the source
data, such as grouping, clustering, predicting, or a combination of these. We’ll cover data
mining implementation for you in greater detail in Chapters 12 and 13.
46 Part I Business Intelligence for Business Decision Makers and Architects

You can start by opening and reviewing the samples in BIDS. BIDS conveniently includes one
or more viewers for each data mining model. Note the following five structures:

■■ Targeted Mailing
■■ Market Basket
■■ Sequence Clustering
■■ Forecasting
■■ Customer Mining

Each model is an implementation of one of the data mining algorithms included in SSAS.
Figure 2-11 shows the data mining structures included in Solution Explorer in BIDS. Note
also that mining models are grouped into a data mining structure (based on a common data
source).

FIgure 2-11 AdventureWorksDW2008 data mining structures

As you did with the OLAP cube browser in BIDS, you should spend a bit of time using the
included mining model viewers in BIDS to look at each of the models. To do this, simply click
on the data mining structure you want to explore from the list shown in Solution Explorer.
Doing so opens that particular mining structure in the BIDS designer. You then double-click
the Mining Model Viewer tab to use the included data mining model viewers. Note that you
can often further refine your view by configuring available parameters.

Also note that the type of viewer or visualizer changes depending on which algorithm was
used to build the data mining model. Realizing this helps you understand the capabilities of
the included data mining algorithms. A key difference between Microsoft’s implementation
of data mining and all other competitors is that Microsoft’s focus is on making data mining
accessible to a broader segment of developers than would traditionally use data mining. The
authors of this book can be included in this target group. None of us has received formal
training in data mining; however, we’ve all been able to successfully implement data mining
in production BI projects because of the approach Microsoft has taken in its tooling in BIDS
and Excel 2007.
Chapter 2 Visualizing Business Intelligence Results 47

As with the OLAP cube viewer, the data mining viewers in BIDS are not meant to be used by
end users; rather, they are included for developer use. Figure 2-12 shows the Dependency
Network view for the Targeted Mailing sample mining structure (looking at the first included
mining model, which was built using the Microsoft Decision Trees algorithm). We’ve adjusted
the view by dragging the slider underneath the All Links section on the left to the center
position so that you’re looking at factors more closely correlated to the targeted value, which
in this case is whether or not a customer has purchased a bicycle.

We particularly like this view because it presents a complex mathematical algorithm in an


effective visual manner—one that we’ve been able to translate to nearly every client we’ve
ever worked with, whether they were developers, analysts, BDMs, or other types of end users.

FIgure 2-12 BIDS mining model Dependency Network view

Viewing Data Mining Structures Using Excel 2007


To view these sample models using Excel 2007, you must first download and install the SQL
Server 2008 Data Mining Add-ins for Office 2007 from http://www.sqlserverdatamining.com/
ssdm/Default.aspx?tabid=102&Id=374. The add-ins include three types of support for view-
ing data mining models in Excel and Visio 2007. Specifically, after being installed, the add-ins
48 Part I Business Intelligence for Business Decision Makers and Architects

make modifications to the Ribbons in both Excel and Visio. In Excel, two new tabs are added:
Data Mining and Table Tools Analyze. In Visio, one new template is added: Data Mining. After
downloading the add-ins, install them by following the steps described in the next paragraph
to get the samples up and running.

Run SQLServer2008_DMAddin.msi to install the Data Mining Add-ins. After the add-ins have
successfully installed, navigate to the Microsoft SQL Server 2008 Data Mining Add-Ins menu
and click the Server Configuration Utility item to start the wizard. Here you’ll configure the
connection to the Analysis Services instance by listing the name of your SSAS instance (prob-
ably localhost) and then specifying whether you’ll allow the creation of temporary mining
models. Then specify the name of the database that will hold metadata. When you enable
the creation of temporary (or session) mining models, authorized users can create session
mining models from within Excel 2007. These models will be available only for that particular
user’s session—they won’t be saved onto the SSAS server instance. Click Finish to complete
the configuration.

Next open the included Excel sample workbook located at C:\Program Files\Microsoft SQL
Server 2008 DM Add-Ins. It’s called DMAddins_SampleData.xlsx. After successfully install-
ing the add-ins, you’ll notice that a Data Mining tab has been added to the Excel Ribbon, as
shown in Figure 2-13.

FIgure 2-13 The Data Mining tab on the Excel 2007 Ribbon

As we did when discussing the OLAP cubes using Excel 2007’s PivotTable view, we encourage
you to explore the data mining samples in depth. Let’s review what you’re looking at here.
For starters, if you click on the sample workbook page called Associate, you can quickly and
easily build a temporary mining model to give you an idea of what the end user’s experience
could be like. To do this, click on any cell that contains data on the Associate page, and then
click on the Associate button in the Data Modeling group shown in Figure 2-14.

This launches the Association Wizard. If you just accept all the defaults on the wizard, you’ll
build a permanent data mining model using one of the included Microsoft data mining algo-
rithms. A permanent model is one that is physically created on the SSAS server instance. You
also have the option to create temporary mining models, if you select that option on setup.
Temporary mining models are created in the Excel session only.

So what’s the best way to see exactly what you’ve done? And, more importantly, how do
you learn what can be done? We find that the best way is to use the included mining model
views, which (as we mentioned) mirror the views that are included in BIDS. Figure 2-15
shows one of the views for this particular model—the Dependency Network for Microsoft
Chapter 2 Visualizing Business Intelligence Results 49

Association. Did you notice the one small difference between this view in BIDS and in Excel?
In Excel, the view includes a small Copy To Excel button in its lower left corner.

FIgure 2-14 Associate button in the Data Modeling group

FIgure 2-15 Dependency Network view for Microsoft Association in Excel 2007
50 Part I Business Intelligence for Business Decision Makers and Architects

If you find yourself fascinated with the possibilities of the Data Mining tab on the Excel
Ribbon, you’re in good company! Exploring this functionality is not only fun, it’s also a
great way for you as a BI developer to really understand what data mining can do for your
company. We devote two entire chapters (Chapter 23, “Using Microsoft Excel 2007 as an
OLAP Cube Client,” and Chapter 24, “Microsoft Office 2007 as a Data Mining Client“) to cov-
ering in detail all the functionality available when using Excel as both an OLAP cube and data
mining model client.

Building a Sample with Your Own Data


A common result after running through the initial exercise of viewing the included samples is
for you to immediately want to get started building a prototype cube or mining model with
your own data. This is a good thing! Very shortly (in Chapters 5 and 6), we’ll explore how to
build physical and logical cubes and mining models so that you can do just that.

You’ll be pleasantly surprised at how quickly and easily you can build prototypes—SSAS has
many features included to support this type of building. Remember, however, that quick pro-
totype building is just that—prototyping. We’ve seen quite a few quick prototypes deployed
to production—with very bad results. We spend quite a bit of time in later chapters differen-
tiating between quick modeling and building production structures so that you can avoid the
costly mistake of mistaking prototypes for working models.

You might be able to build a quick cube or mining model in a day or less, depending on the
complexity and quality of your source data. It’s quite a bit of fun to do this and then show off
the possibilities for your organization if it was to envision, design, build, test, and deploy pro-
duction cubes and mining models.

Now that we’ve taken a look at what an end user sees, let’s return to our discussion of why
you’d want to use SQL Server 2008 BI solutions. Specifically, we’ll look at component imple-
mentation considerations and return on investment (ROI) considerations. Then we’ll finish
with a summary that we’ve found useful when we’ve made the pitch for BI to clients.

elements of a Complete BI Solution


In the world of BI solutions, a complete solution consists of much more than the OLAP cubes
and data mining structures built using SSAS. We would go so far as to say that to the ultimate
consumers of the BI project—the various end-user communities—a well-designed solution
should have data store sources that are nearly invisible. This allows those users to work in
a familiar and natural way with their enterprise data. This means, of course, that the selec-
tion of clients tools, or reporting tools, is absolutely critical to the successful adoption of the
results of your BI project.
Chapter 2 Visualizing Business Intelligence Results 51

There are other considerations as well, such as data load, preparation, and more. Let’s talk a
bit more about reporting first.

Reporting—Deciding Who Will Use the Solution


A key aspect when thinking about end-user client tools (or reporting interfaces) for SSAS
OLAP cubes and data mining structures is to review and determine what types of audiences
you propose to include in your solution. For example, you might select a more sophisticated
reporting tool for a dedicated segment of your staff, such as financial analysts, while you
might choose a simpler interface for another segment, such as help desk analysts. We’ve
found that if a dedicated BI solution is entirely new to your enterprise, it’s important to focus
on simplicity and appropriateness for particular end-user audiences.

Tip We’ve had good success profiling our end-user types during the early phases of our project.
To do this, we interview representative users from each target group, as well as their supervi-
sors. We also take a look at what type of tools these users work with already so that we can get a
sense of the type of environment they’re comfortable working in. We document these end-user
types and have subject matter experts validate our findings. We then propose reporting solu-
tions that are tailored to each end-user group type—that is, we implement an Excel PivotTable
view for district managers, implement Microsoft Office SharePoint Server 2007 dashboards for
regional directors, and so on.

Because of the importance of developing appropriate client interfaces, we’ve devoted an


entire section of this book to a comprehensive discussion of reporting clients. This discussion
includes looking at using Excel, Office SharePoint Server 2007, and PerformancePoint Server.
It also includes examining the concerns about implementing or developing custom clients,
such as Windows Forms or Web Forms applications. It’s our experience that most enterprise
BI solutions use a selection of reporting interfaces. Usually, there are two or more different
types of interfaces. We’ve even worked with some clients for which there was a business need
for more than five different types of client reporting interfaces.

Just to give you a taste of what is to come in our discussion about client reporting interfaces,
we’ve included an architectural diagram from SQL Server Books Online that details connect-
ing to an OLAP cube or data mining structure via a browser (a thin-client scenario). This is just
one of many possible scenarios. Notice in Figure 2-16 that there are many choices for how to
establish this type of connection. This diagram is meant to get you thinking about the many
possibilities of client reporting interfaces for BI solutions.
52 Part I Business Intelligence for Business Decision Makers and Architects

Browser Browser Browser Other Thin Clients

Web

Internet Information Services (IIS)

ASP ASP.NET ASP, ASP.NET, etc.

COM-Based
Client .NET Client
Win32
Applications for Applications
Applications
OLAP and/or for OLAP
for OLAP
Data Mining and/or
and/or
Data Mining Data Mining

ADO MD

Any
Application
OLE DB for OLAP ADO MD.NET for OLAP and
Data Mining

XMLA
over
TCP/IP

Instance of SQL Server 2008


Analysis Services

FIgure 2-16 Client architecture for IIS from SQL Server Books Online

ETL—Getting the Solution Implemented


One often underestimated concern in a BI project is the effort needed to consolidate, vali-
date, and clean all the source data that will be used in the SSAS OLAP cubes and data mining
structures. Of course, it’s not an absolute requirement to use SSIS for ETL (extract, transform,
and load) processes. However, in our experience, not only have we used this powerful tool for
Chapter 2 Visualizing Business Intelligence Results 53

100 percent of the solutions we’ve designed, but also we’ve used it extensively in those solu-
tions. As mentioned in Chapter 1, often 50 to 75 percent of the time spent on initial project
implementation can revolve around the ETL planning and implementation.

Because of the importance of SSIS in the initial load phase of your BI project, as well as in the
ongoing maintenance of OLAP cubes and data mining structures, we’ve devoted an entire
section of this book to understanding and using SSIS effectively. It has been our experience
that a general lack of understanding of SSIS combined with a tendency for database adminis-
trators (DBAs) and developers to underestimate the dirtiness (or data quality and complexity)
of the source data has led to many a BI project delay.

Tip For many of our projects, we elected to hire one or more SSIS experts to quickly perform
the heavy lifting in projects that included either many data sources, data that was particularly
dirty, or both. We have found this to be money well spent, and it has helped us to deliver proj-
ects on time and on budget.

In our entire section (six chapters) on SSIS, we guide you through tool usage, as well as pro-
vide you with many best practices and lessons learned from our production experience with
this very important and powerful tool.

Data Mining—Don’t Leave It Out


As we saw by taking a closer look at the included samples earlier in this chapter, data min-
ing functionality is a core part of Microsoft’s BI offering. Including it should be part of the
majority of your BI solutions. This might require that you educate the development team—
from BDMs to developers. We recommend bringing in outside experts to assist in this edu-
cational process for some BI projects. All internal parties will significantly benefit if they have
the opportunity to review reference implementations that are verticals (or industry-specific
implementations) if at all possible. Microsoft is also working to provide samples to meet this
need for reference information; to that end, you should keep your eye on its case study Web
site for more case studies that deal with data mining solutions.

We have not seen broad implementation of data mining in production BI solutions built
using SQL Server 2005 or 2008. We attribute this to a lack of understanding of the business
value of this tool set. Generally, you can think of appropriate use of data mining as being a
proactive use of your organization’s data—that is, allowing SSAS to discover patterns and
trends in your data, and to predict important values for you. You can then act on those pre-
dictions in a proactive manor. This use is in contrast to the typical use of OLAP cubes, which
is decision support. Another way to understand decision support is to think of cubes as being
used to validate a hypothesis, whereas mining structures are used in situations where you
have data but have not yet formed a hypothesis to test.
54 Part I Business Intelligence for Business Decision Makers and Architects

We’ve made a conscious effort to include data mining throughout this book—with subjects
such as physical modeling, logical modeling, design, development phases, and so on—rather
than to give it the common treatment that most BI books use, which is to devote a chapter to
it at the end of the book. We feel that the data mining functionality is a core part of the SSAS
offering and exploring (and using) it is part of the majority of BI solutions built using SQL
Server 2008.

Common Business Challenges and BI Solutions


Let’s summarize business challenges and BI solution strengths and then discuss translat-
ing these abilities into ROI for your company. You might want to refer back to the Top 10
Questions List earlier in this chapter, as you now are beginning to get enough information to
be able to pull the answers to those questions together with the capabilities of the BI product
suite. As you examine business needs and product capabilities, you’ll move toward envision-
ing the scope of your particular implementation. Here are some ways to meet challenges you
might face as you envision and develop BI solutions:

■■ Slow-to-execute queries Use OLAP cubes rather than OLTP (normalized) data sources
as sources for reports. OLAP cubes built using SQL Server Analysis Services are opti-
mized for read-only queries and can be 1000 percent faster in returning query results
than OLTP databases. This performance improvement comes from the efficiency of the
SSAS engine and storage mechanisms. This is a particularly effective solution if a large
amount of data aggregation (or consolidation) is required.
■■ General OLTP source system slowdowns Query against OLAP cubes or data min-
ing models rather than original OLTP databases. This approach greatly reduces locking
overhead from OLTP source systems. (OLAP systems don’t use locks except during pro-
cessing.) Also, OLAP cubes remove overhead from OLTP production source systems by
moving the reporting mechanism to a different data store and query engine. In other
words, the Analysis Services engine is processing reporting queries, so the query pro-
cessing load on the OLTP source systems is reduced.
■■ Manual query writing Allow end users to click to query (such as by using click and
drag on pivot tables or by using other types of end-user client tools). Providing this
functionality eliminates the wait time associated with traditional OLTP reporting.
Typically, new or custom queries against OLTP databases require that end users request
the particular reports, which then results in developers needing to manually write que-
ries against source OLTP systems. An example is the need to manually write complex
Transact-SQL queries if an RDBMS relational data source were being queried. Also,
these Transact-SQL queries often need to be manually tuned by developers or admin-
istrators because of the processing load they might add to the production data stores.
This tuning can involve query rewriting to improve performance and also can involve
Chapter 2 Visualizing Business Intelligence Results 55

index creation, which adds overhead to the source system and can take significant
administrative effort to implement and maintain.
■■ Disparate data sources Combine data into central repositories (OLAP cubes) using
ETL packages created with SQL Server Integration Services. These packages can be
automated to run on a regular basis. Prior to implementing BI, we’ve often seen end
users, particularly analysts, spending large amounts of time manually combining
information. Analysis Services cubes and mining structures used in combination with
Integration Services packages can automate these processes.
■■ Invalid or inconsistent report data Create ETL packages via SSIS to clean and vali-
date data (prior to loading cubes or mining structures). Cubes provide a consistent
and unified view of enterprise data across the enterprise. As mentioned, we’ve often
noted that a large amount of knowledge workers’ time is spent finding and then manu-
ally cleansing disparate or abnormal data prior to the implementation of a dedicated
BI solution. Inconsistent results can be embarrassing at the least and even sometimes
costly to businesses, as these type of issues can result in product or service quality
problems. We’ve even seen legal action taken as a result of incorrect data use in busi-
ness situations.
■■ Data is not available to all users BI repositories—OLAP cubes and data mining struc-
tures—are designed to be accessed by all business users. Unlike many other vendors’
BI products, Microsoft has integrated BI repository support into many of its end-user
products. These include Microsoft Office Word, Excel, and Visio 2007; Office SharePoint
Server 2007 (via the Report Center template); and many others. This inclusion extends
the reach of a BI solution to more users in your business. It’s exciting to consider how
effective BI project implementation can make more data available to more of your
employees.
■■ Too much data Data mining is particularly suited to addressing this business issue,
as the included algorithms automatically find patterns in huge amounts of data. SSAS
data mining contains nine types of data mining algorithms that group (or cluster), and
(optionally) correlate and predict data values. It’s most common to implement data
mining when your company has data that is not currently being used for business anal-
ysis because of sheer volume, complexity, or both.
■■ Lack of common top-level metrics Key performance indicators (KPIs) are particularly
effective in helping you define the most important metrics for your particular business.
SSAS OLAP cubes support the definition and inclusion of KPIs via wizards that gener-
ate the MDX code. MDX code can also be written manually to create KPIs. Across the
BI suite of tools and products, the notion of KPIs is supported. This is because it’s a
common requirement to have a dashboard-style view of the most important business
metrics.
56 Part I Business Intelligence for Business Decision Makers and Architects

Now let’s translate these common business problems into a general statement about the
capabilities of BI and tie those capabilities to ROI for businesses that implement BI solutions.

Measuring the rOI of BI Solutions


BI is comprehensive and flexible. A single, correctly designed cube can actually contain all
of an organization’s data, and importantly, this cube will present that data to end users
consistently.

The ability to be comprehensive is best expressed by the various enhancements in SSAS


OLAP cubes related to scalability. These include storage and backup compression, query
optimization, and many more. These days, it’s common to see multiterabyte cubes in produc-
tion. This will result in reduced data storage and maintenance costs, as well more accurate
and timely business information. It will also result in improved information worker productiv-
ity, as end users spend less time getting the right or needed information.

To better understand the concept of flexibility, think about the Adventure Works sample
OLAP cube as displayed using the Excel PivotTable view. One example of flexibility in this
sample is that multiple types of measures (both Internet and Retail Sales) have been com-
bined into one structure. Most dimensions apply to both groups of measures, but not all do.
For example, there is no relationship between the Employee dimensions and any of the mea-
sures in the Internet Sales group—because there are no employees involved in these types of
sales.

Cube modeling is now flexible enough to allow you to reflect business reality in a single
cube. In previous versions of SSAS, and in other vendors’ products, you would’ve been forced
to make compromises—creating multiple cubes or being limited by structural requirements.
This lack of flexibility in the past often translated into limitations and complexity in the cli-
ent tools as well. The enhanced flexibility in Microsoft BI applications will result in improved
ROI from reporting systems being more agile because of the “click to query” model. Rather
than requesting a new type of report—sending the report request to an administrator for
approval and then to a developer to code the database query and possibly to a database
administrator to tune the query—by using OLAP, the end user can instantly perform a drag
and drop operation to query or view the information in whatever format is most useful.

BI is accessible (that is, intuitive for all end users to view and manipulate). To better under-
stand this aspect of BI, we suggest that you try demonstrating the pivot table based on the
SSAS sample cube to others in your organization. They will usually quickly understand and be
quite impressed (some will even get excited!) as they begin to see the potential reach for BI
solutions in your company.

Pivot table interfaces reflect the way many users think about data—that is, “What are the
measures (or numbers), and what attributes (or factors) created these numbers?”
Chapter 2 Visualizing Business Intelligence Results 57

Some users might request a simpler interface than a pivot table (that is, a type of canned or
prefab report). Microsoft provides client tools—for example, SSRS—that facilitate that type
of implementation. It’s important for you to balance the benefits of honoring this type of
request, which entails manual report writing by you, against the benefits available to end
users who can use pivot tables. It has been our experience that most BI solutions include a
pivot table training component for end users who haven’t worked much with pivot tables.
This training results in improved ROI because more information will be useful for more end
users, and that will result in better decision making at your business.

BI is fast to query. After the initial setup is done, queries can easily run 1000 percent faster in
an OLAP database than in an OLTP database. Your sample won’t necessarily demonstrate the
speed of the query itself. However, it’s helpful to understand that the SSAS server is highly
optimized to provide a query experience that is far superior to, say, a typical relational data-
base. It’s superior because the SSAS engine itself is designed to quickly fetch or calculate
aggregated values. This will result in improved ROI because end users will spend less time
waiting for reports to process. We’ll dive into the details of this topic in Chapters 20, 21,
and 22.

BI is simple to query. End users simply drag items into and around the PivotTable area, and
developers write very little query code manually. It’s important to understand that SSAS
clients (such as Excel) automatically generate MDX queries when users drag and drop dimen-
sions and measures onto the designer surfaces. This is a tremendous advantage compared
to traditional OLTP reporting solutions, where Transact-SQL developers must manually write
all the queries. This simplicity will result in improved ROI because of the ability to execute
dynamic reporting and the increased agility on the part of the end users. They will be better
able to ask the right questions of the data in a timely way as market conditions change.

BI provides accurate, near real­time, summarized information. This functionality will improve
the quality of business decisions. Also, with some of the new features available in SSAS, par-
ticularly proactive caching, cubes can have latency that is only a number of minutes or even
seconds. This will result in improved ROI because the usefulness of the results will improve
as they are delivered in a more timely way. We’ll discuss configuring real-time cubes in
Chapter 9.

Also, by drilling down into the information, users who need to see the detail—that is, the
numbers behind the numbers—can do so. The ability to drill down is, of course, implemented
in pivot tables via the simple “+” interface that is available for all (summed) aggregations in
the Adventure Works sample cube. This drill-down functionality results in improved ROI by
making results more actionable and by enabling users to quickly get the level of detail they
need to make decisions and to take action.

BI includes data mining. Data mining allows you to turn huge amounts of information into
actionable knowledge by applying the included data mining algorithms. These can group
(or cluster) related information together. Some of the algorithms can group and predict one
58 Part I Business Intelligence for Business Decision Makers and Architects

or more values in the data that you’re examining. This will result in improved ROI because
end users are presented with patterns, groupings, and predictions that they might not have
anticipated, which will enable them to make better decisions faster.

For the many reasons just mentioned, BI solutions built using SQL Server 2008, if imple-
mented intelligently, will result in significant ROI gains for your company. Most companies
have all the information they need—the core problem is that the information is not accessi-
ble in formats that are useful for the people in those companies to use as a basis for decision
making in a timely way.

It’s really just that straightforward: OLAP and data mining solutions simply give businesses
a significant competitive advantage by making more data available to more end users so
that those users can make better decisions in a more timely way. What’s so exciting about
BI is that Microsoft has made it possible for many companies that couldn’t previously afford
to implement any type of BI solution to participate in this space. Microsoft has done this by
including with SQL Server 2008 all the core BI tools and technologies needed to implement
cubes. Although it’s possible to implement both SQL Server and SSAS on the same physical
server (and this is a common approach for development environments), in production situ-
ations we generally see at least one physical server (if not more) dedicated to SSAS. Also, it’s
important to understand which BI features require the Enterprise edition of SQL Server or
SSAS. We’ll review feature difference by edition in detail throughout this book.

In addition to broadening BI’s reach by including some BI features in both the Standard and
Enterprise editions of SQL Server, Microsoft is also providing some much-needed compe-
tition at the enterprise level. They have done this by including some extremely powerful
BI features in the Enterprise editions of SQL Server and SSAS. We’ll talk more about these
features as they apply to the specific components—that is, SSAS, SSIS, and SSRS—in the
respective sections where we drill down into the implementation details of each of those
components.

Summary
In this chapter, we took a closer look at OLAP and data mining concepts by considering the
most common business problems that the SQL Server 2008 BI toolset can alleviate and by
exploring the included samples contained in the AdventureWorksDW2008 database available
on the CodePlex Web site.

We next took a look at these samples from an end user’s perspective. We did this so that we
could experience OLAP cubes and data mining models using the commonly used client tools
available in Excel 2007. There are, of course, many other options in terms of client interfaces.
We’ll explore many of the clients in Part III, “Microsoft SQL Server 2008 Integration Services
for Developers.” You should carefully consider the various client interfaces at the beginning
Chapter 2 Visualizing Business Intelligence Results 59

of your BI project. Acceptance and usages depends on the suitability of these tools to the
particular end-user group types.

We completed our discussion by recapping common ROI areas associated with BI projects.
Securing executive stakeholder support (and, of course, funding!) is critical to the success of
all BI projects.

In the next chapter, we’ll take a look at the softer side of BI projects. This includes practi-
cal advice regarding software development life cycle methodologies that we have found to
work (and not to work). Also, we’ll discuss the composition of your BI project team and the
skills needed on the team. We’ve seen many projects that were delayed or even completely
derailed because of a lack of attention to these soft areas, so we’ll pass along our tips from
the real world in the next chapter.
Chapter 3
Building Effective Business
Intelligence Processes
You might be wondering why we’re devoting an entire chapter to what is referred to as the
softer side of business intelligence (BI) project implementations—the business processes and
staffing issues at the heart of any BI solution. Most BI projects are complex, involving many
people, numerous processes, and a lot of data. To ignore the potential process and people
challenges inherent in BI projects is to risk jeopardizing your entire project. A lack of under-
standing and planning around these issues can lead to delays, cost overruns, and even proj-
ect cancellation.

In this chapter, we share lessons we’ve learned and best practices we follow when dealing
with the process and people issues that inevitably crop up in BI projects. We’ve had many
years of real-world experience implementing BI projects and have found that using known
and proven processes as you envision, plan, build, stabilize, and deploy your projects reduces
their complexity and lessens your overall risk. We start by explaining the standard software
development life cycle for business intelligence, including some of the formal models you
can use to implement it: Microsoft Solutions Framework (MSF) and MSF for Agile Software
Development. We then examine what it takes to build an effective project team: the skills
various team members need and the options you have for organizing the team.

Project leaders might need to educate team members so that they understand that these
processes aren’t recommended just to add bureaucratic overhead to an already complex
undertaking. In fact, our guiding principle is “as simple as is practical.” We all want to deliver
solutions that are per specification, on time, and on budget. The processes we describe in this
chapter are the ones we use on every BI project to ensure that we can deliver consistently
excellent results.

Software Development Life Cycle for BI Projects


BI projects usually involve building one or more back-end data stores (OLAP cubes, data
mining models, or both). These cubes and mining models often have to be designed from
scratch, which can be especially difficult when these types of data stores are new to the
developers and to the enterprise. Next, appropriate user interfaces must be selected, config-
ured, and sometimes developed. In addition to that, concerns about access levels (security in
general), auditing, performance, scalability, and availability further complicate implementa-
tion. Also, the process of locating and validating disparate source data and combining it into
the new data models is fraught with complexity. Last but not least, for many organizations,

61
62 Part I Business Intelligence for Business Decision Makers and Architects

the BI toolset—SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS),
and SQL Server Reporting Services (SSRS)—is new and has to be learned and mastered. So
how can you reduce the complexity of your project and improve the chances it will be deliv-
ered per the specification, on time, and on budget? Our answer is simple: follow—or use as a
framework—a proven software development life cycle model, such as the Microsoft Solutions
Framework or its Agile version.

Note What if you don’t want to use MSF? We offer MSF (in both its classic and newer Agile
forms) as a sample software development life cycle for BI projects. It is a method that has worked
for us, in the sense that the framework proved flexible enough to be useful, yet also structured
enough to add value in terms of predictability and, ultimately, the on-time and on-budget deliv-
ery of results. If you have a different software development life cycle that you and your team are
more comfortable with, by all means, use it. We simply want to emphasize that, given the com-
plexity and scale of most BI projects, we feel it’s quite important to use some sort of structured
process to improve the outcome of your project.

Microsoft Solutions Framework


Microsoft Solutions Framework (MSF) is a flexible software development life cycle that we’ve
successfully applied to various BI projects. It consists of known phases and milestones, which
are characteristic of the waterfall method, but it also has the iterations (or versions) found in
a spiral method. The combination of structure and flexibility in the MSF makes it well-suited
to BI projects. Such projects are usually mission critical, so milestones are expected, but they
are often iterative as well because of the scope of changes required as data is discovered and
cleaned in the extract, transform, and load (ETL) processes. Figure 3-1 shows the MSF soft-
ware development life cycle model.

Tip Another aspect of BI projects is the level of knowledge of stakeholders as to what is pos-
sible to be built with the BI suite. We often find that as stakeholders understand the applications
of BI to their particular business, they will increase the scope of the project. One example of this
is the decision to add data mining to an existing BI project based on a demonstration of a pilot
project. Another example of iteration is a request for a new OLAP cube—often the second cube
we build is for the finance department, with the first one usually being for the sales or marketing
department.

As shown in Figure 3-1, MSF has distinct project phases (envision, plan, build, stabilize, and
deploy) and project milestones (vision/scope approved, project plans approved, scope com-
plete, release readiness approved, and deployment complete). MSF further advocates for
particular roles and responsibilities in the software development life cycle, a topic we cover in
detail later in this chapter.
Chapter 3 Building Effective Business Intelligence Processes 63

Deploy

Release Readiness Approved Deployment Complete

Envision
Stabilize
Release 1

Scope Complete Vision/Scope Approved

Plan

Build
Project Plans Approved
FIgure 3-1 Phases and milestones in Microsoft Solutions Framework

For more information about MSF in general, go to http://www.microsoft.com/technet/


solutionaccelerators/msf/default.mspx. This site includes detailed explanations about the MSF
process phases, deliverables, roles, and more. It also includes case studies and sample deliv-
erable templates. All of this information is generic by design (one of the strengths of MSF), so
it can easily be adapted to any type of project. We’ll talk a bit about how we’ve done just that
in the next section.

Microsoft Solutions Framework for Agile Software


Development
Microsoft Solutions Framework for Agile Software Development, or MSF version 4, is even
more generic and flexible (or agile) than the original version of MSF. From the MSF Agile pro-
cess guidance, here is the definition:

Microsoft Solutions Framework (MSF) for Agile Software Development is a


scenario-driven, context-based, agile software development process for building
.NET and other object-oriented applications. MSF for Agile Software Development
directly incorporates practices for handling quality of service requirements such
as performance and security. It is also context-based and uses a context-driven
approach to determine how to operate the project. This approach helps create an
adaptive process that overcomes the boundary conditions of most agile software
development processes while achieving the objectives set out in the vision of the
project.
64 Part I Business Intelligence for Business Decision Makers and Architects

The scenario-driven (or use-case driven) and context-based approaches of MSF Agile are par-
ticularly suited to BI projects for three reasons:

■■ Because of the scoping challenges we mentioned in the preceding section


■■ Because of the need to provide the context-specific (or vertical-specific) approach
for the stakeholders so that they can see what BI can do for their particular type of
business
■■ Because of their inherent agility

Every BI project we’ve worked on has been iterative—that is, multiversioned—and the final
scope of this version of the project varied significantly from the original specification.

The spiral method can be particularly tricky to grasp when you’re beginning to work with
MSF Agile. This quote from the MSF will help you understand this complex but important
concept:

The smooth integration of MSF for Agile Software Development in Visual Studio
Team System supports rapid iterative development with continuous learning and
refinement. Product definition, development, and testing occur in overlapping
iterations resulting in incremental completion of the project. Different iterations
have different focus as the project approaches release. Small iterations allow you to
reduce the margin of error in your estimates and provide fast feedback about the
accuracy of your project plans. Each iteration should result in a stable portion of the
overall system.

Figure 3-2 shows how the combination of iterations along with implementations of project-
specific implementation phases—such as setup, planning, and so on—works.

Iteration Iteration
Iteration 1 Repeat as n
0 Needed

Plan Develop & Test


Project Setup Develop & Test Plan Release Product
Plan Feedback Develop & Test
Feedback
FIgure 3-2 Cycles and iterations in MSF Agile
Chapter 3 Building Effective Business Intelligence Processes 65

We have used a variant of the MSF software development life cycle, either MSF 3.0 (standard)
or MSF 4.0 (agile), for every BI project we’ve implemented. We firmly believe that MSF is a
solidly useful set of guidance that results in more predictable results.

Note MSF Agile is built into Microsoft Visual Studio Team System. If your organization uses
Visual Studio Team System, you can use the free MSF Agile templates and guidance available
at http://www.microsoft.com/downloads/details.aspx?familyid=EA75784E-3A3F-48FB-824E-
828BF593C34D&displaylang=en. Using Visual Studio Team System makes it easier to address a
number of important project issues—such as code source control, work item assignments, and
monitoring—but it isn’t required if you choose to use MSF as a software development life cycle
methodology for your BI project.

Applying MSF to BI Projects


As you saw in Figure 3-1, MSF has five distinct project phases (envision, plan, build, stabilize,
and deploy) and five project milestones (vision/scope approved, project plans approved,
scope complete, release readiness approved, and deployment complete). Now let’s drill
down and apply MSF to specific situations you will encounter in BI projects. We again want
to remind you that the failure to adopt some kind of software development life cycle usually
has negative results for BI projects. Although in more classic types of software projects, such
as an ASP.NET Web site, you might be able to get away with development on the fly, with BI
projects you’re asking for trouble if you skip right to building. We’ve seen BI projects delayed
for months and halted and restarted midstream, pitfalls that could have been avoided if a
more formal software development life cycle model had been followed. To get started, we’ll
dive into the specific phases of the MSF as applied to BI projects.

Phases and Deliverables in the Microsoft Solutions Framework


To get started, we’ll walk through the phases and deliverables. Our approach refers to
both MSF for Capability Maturity Model Integration (CMMI) and MSF for Agile Software
Development. Generally, you can assume that the type of MSF software development life
cycle you use will vary depending on the type of organization you work in and its culture.
MSF can be more formal (that is, it can have phases and iterations plus more public mile-
stones). Another way to define this more formal implementation is as “using the CMMI meth-
odology.” Or MSF can be more informal (or agile, meaning fewer formal milestones and more
iterations). We’ll start by taking a deeper look at the general framework, that is, phases and
so on, then later in this chapter we’ll drill into applying this framework to BI projects.

Envisioning
The best guidance we can give related to BI-specific projects is this: at the envisioning phase,
your team should be thinking big and broad. For example, we suggest that you automatically
66 Part I Business Intelligence for Business Decision Makers and Architects

include data mining in every project. Another way to think of this approach is the “If in
doubt, leave it in” theory of design. In particular, make sure to include all source data and
plan for some type of access to some subset of data for everyone in the company. As we said
in Chapter 1, “Business Intelligence Basics,” the BI toolset in SQL Server 2008 was designed to
support this idea of BI for everyone. We’ve often seen fallout from the “complex data for the
analysts only” mindset, which unnecessarily limits the scope of a BI project. At a later point in
your BI project, your team will trim the big and broad ideas of this phase into an appropriate
scope for the particular iteration of your project.

During this phase, be sure to include at least one representative (preferably more) from all
possible user group types in the envisioning discussions. This group might include executives,
analysts, help desk employees, IT staff, front-line workers, and so on. Allotting enough time
and resources to accurately discover the current situation is also important. Make sure your
team finds out what reports don’t exist or aren’t used (or are thought impossible) because
of current data limitations. Another critical discovery step is to find and document the loca-
tions of all possible source data, including relational data stores such as Microsoft SQL Server,
Oracle, and DB2; nonrelational data such as .CSV files from mainframes, XML, Microsoft
Office Excel, Microsoft Access; and so on.

Tip When you’re hunting for data, be sure to check people’s local machines. We’ve found many
a key Excel spreadsheet in a My Documents folder!

After you’ve gathered all the information, you can begin to prioritize the business problems
you’re trying to affect with this project and then match those problems to product features
that will solve them. (Refer to Chapter 1 and Chapter 2, “Visualizing Business Intelligence
Results,” for a detailed discussion of matching particular business problems to the BI features
of SQL Server 2008.) An example of this is a situation where your company has recently pur-
chased a competitor and you now have the challenge of quickly integrating a huge amount
of disparate data. Typical BI solutions for this type of problem include extensive use of SSIS to
clean, validate, and consolidate the data, and then creating data mining structures to explore
the data for patterns, and sometimes also building an OLAP cube to make sense of the data
via pivot table reports against that cube data. We see quite a few mistakes made in the envi-
sioning phase. Most often they are because of a simple lack of resources allocated for discov-
ery and planning. Avoid the “rush to build (or code)” mentality that plagues so many software
projects.

With that said, however, we do not mean that you should refrain from building anything
during envisioning. Quite the contrary, one of the great strengths of SSAS is that you can
easily create quick prototype OLAP cubes and data mining structures. We saw in Chapter 2
that a few clicks are all you need to enable the AdventureWorks samples. In some ways, the
developer user interface is almost too easy because inexperienced BI developers can easily
create quick data structure prototypes using the built-in GUI wizards and designers, but the
Chapter 3 Building Effective Business Intelligence Processes 67

structures they build will be unacceptably inefficient when loaded with real-world quantities
of data. Our caution here is twofold:

■■ Developers of OLAP cubes, data mining models, SSIS packages, and SSRS reports can
and should create quick prototypes to help explain BI concepts to team members and
business decision makers (BDMs).
■■ Developers should understand that these prototypes should not be used in production
situations, because during the rapid prototyping process they are not optimized for
performance (scalability, availability), usability (securability, ease of use), and so on.

The goal of the envisioning phase is to move the team toward a common vision and scope
document. The level of detail and size of this document will vary depending on the size of
the organization and the overall complexity of the project. This document will be used to set
the tone for the project. It will help to keep the team focused on the key business goals, and
it will serve as a basis for agreement with the stakeholders. The document will also be use-
ful as tool to trace business needs to specific product features in later phases of the software
development life cycle process.

Planning
The goal of the planning phase is to create as detailed a design plan as is possible and appro-
priate for your particular BI project. Also, during this phase the development and test envi-
ronments should be set up.

Activities during this phase include the following:

■■ Selecting a modeling tool (or tools) that works for you and your team This tool
will be used to model the OLAP cubes and mining models. You can use a traditional
data modeling tool, such as Visio or ERwin for this process, or you can use SSAS itself.
We’ll discuss the latter option in detail in Chapter 5, “Logical OLAP Design Concepts for
Architects.”
■■ Documenting taxonomies as they are actually used (rather than accepting them
as they currently exist) This process is similar to creating a data dictionary. In other
words, the result is a “this-means-that” list, with notes and definitions included as
appropriate. You can use something as simple as an Excel spreadsheet, or you can use
a database or any other tool you’re comfortable with. Your taxonomy capture should
include the following information:
❏■ Natural language of the business Capture this information by conducting
interviews with representatives from the various role groups—that is, executives,
analysts, and so on. Your questions should take a format similar to this: “How do
you refer to x (that is, your customer, client, and so on) in your organization?”
68 Part I Business Intelligence for Business Decision Makers and Architects

❏■ Data source structure names This information includes translating database


table and column names into a common taxonomy. It also includes translating
unstructured metadata, such as XML element or attribute names.

■■ Capturing the current state of all source data This process includes capturing server
names, server locations, IP addresses, and database names. It also includes capturing
accessibility windows for source data. What this means is identifying the times when
source data can be extracted on a regular basis. This information should include the
frequency, load windows (or window of time for extracting data), and credentials to be
used to access the source data. It can also include a discussion of the security require-
ments for the data, data obfuscation or encryption, and possibly auditing of data
access. A common mistake that you should avoid is focusing too narrowly when com-
pleting this step. We’ve seen many companies simply document a single relational data-
base and insist that was the sum total of the important data, only later to return with
data from a large (and disparate) number of sources. These late additions have included
Excel workbooks, Access databases, mainframe data, Windows Communication
Foundation (WCF) or Web service data, XML data, and more.

Prototypes built during this phase usually include sample OLAP cubes and data mining mod-
els, and they also often include report mockups. These prototypes can be hand drawn or
mocked up using whatever easy-to-use software you have available. We’ve used Microsoft
Office PowerPoint, Excel, Word, and SSRS itself to create quick report mockups. You can
also choose to create report mockups using Microsoft Office SharePoint Server 2007 or
PerformancePoint Server.

The result of the planning phase is a specification document. This document should include
designs for the OLAP cubes, data mining models, or both. It should also include some kind
of data map to be used in the development of SSIS packages, report environments (that
is, SSRS, Office SharePoint Server, a custom application), and report requirements. This
document also should contain a list of resources—people, servers, and software. From this
resource list, a budget and work schedule can be drawn up. This document also can serve as
a contract if the BI team is external to the company.

Building
In the building phase of a BI project, there is quite a bit of complexity to managing the
schedule. It’s common that multiple developers will be involved because it’s rare for a single
developer to understand all the technologies involved in BI projects—OLAP, data mining,
SSIS, SSRS, SharePoint, and so on. Data structures can be built in one of two ways—either by
starting with them empty and then filling them with data from source structures, or by build-
ing and loading them at the same time. We’ve found that either approach will work. The one
you choose for your project will depend on the number of people you have working on the
project and the dirtiness (that is, the amount of invalid types, length, characters, and so on) of
the source data.
Chapter 3 Building Effective Business Intelligence Processes 69

A common mistake is to underestimate the amount of resources that need to be allocated to


cleaning, validating, and consolidating source data for the initial cube and mining structure
loads. It’s often 50 to 75 percent of the project time and cost for the initial setup when doing
this in the SSIS package creation phase. Although this might seem like a prohibitive cost, con-
sider the alternative—basing key business decisions on incorrect data!

Figure 3-3 shows the software development life cycle for data mining solutions. You can
envision a similar software development life cycle for building OLAP cubes. The process
arrows show the iterative nature of preparing and exploring data and then building and vali-
dating models. This illustration captures a key challenge in implementing BI solutions—that
is, you can’t proceed until you know what you have (in the source data), and you can’t com-
pletely understand the source data all at once. There’s simply too much of it. So SSIS is used
to progressively clean the data, and then, in this case, data mining is used to understand the
cleansed data. This process is repeated until you get a meaningful result. Data mining model
validation is built into the product so that you can more easily access the results of your
model building.

Defining
Integration
the Problem
Services

Integration Preparing
Services Data
Deploying
and Updating
Models Data
Exploring Source
Data View

Data Mining
Designer Validating
Models Building
Models

FIgure 3-3 Software development life cycle for BI data mining projects (from SQL Server Books Online)

This process results in SSIS developers and SSAS developers needing to work together very
closely. One analogy is the traditional concept of pair programming. We’ve found that
tools such as instant messaging and Live Meeting are quite helpful in this type of working
environment.

Another variation of the traditional developer role in the world of SQL Server 2008 BI is that
the majority of the development you perform takes place in a GUI environment, rather than
in a code-writing window. Although many languages, as previously presented, are involved
70 Part I Business Intelligence for Business Decision Makers and Architects

under the hood in BI, the most effective BI developers are masters of their particular GUI.
For SSAS, that means thoroughly understanding the OLAP cube and data mining structure
designers in BIDS. For this reason, we devote several future chapters to this rich and com-
plex integrated development environment. For SSIS, this means mastering the SSIS pack-
age designer in BIDS. We devote an entire section of this book to exploring every nook and
cranny of that rich environment. For SSRS designers, this means mastering the SSRS designer
in BIDS. Also, other report-hosting environments can be used, including Office SharePoint
Server 2007, PerformancePoint Server, custom Windows Forms, Web Forms, or mobile forms,
and more.

For maximum productivity during the development phase, consider these tips:

■■ Ensure great communication between the developer leads for SSIS, SSAS, and SSRS. It’s
common to have quick, daily recaps at a minimum. These can be as short as 15 min-
utes. If developers are geographically dispersed, encourage use of IM tools.
■■ Validate developers’ skills with BIDS prior to beginning the development phase.
Expecting .NET developers to simply pick up the interface to BIDS is not realistic. Also,
traditional coders tend to try to use more manual code than is needed, and this can
slow project progress.
■■ Establish a daily communication cycle, which optimally includes a daily build.
■■ Establish source control mechanisms, tools, and vehicles.

Note Pay attention to security requirements during the build phase. A common mistake we’ve
seen is that BI developers work with some set of production data—sometimes a complete copy—
with little or no security in place in the development environment. Best practice is to follow
production security requirements during all phases of the BI project. If this includes obscur-
ing, encrypting, or securing the data in production, those processes should be in place at the
beginning of the development phase of your BI project. SQL Server 2008 contains some great
features, such as transparent encryption, that can make compliance to security requirements
much less arduous. For more information about native SQL Server 2008 data encryptions go to
http://edge.technet.com/Media/580.

The goal of the build phase is to produce OLAP cubes and data mining structures that con-
tain production data (often only a subset) per the specification document produced in the
designing phase. Also, reporting structures—which can include SSRS, Office SharePoint
Server 2007, and more—should be complete and available for testing.

Stabilizing
We’ve seen many BI project results rushed into production with little or no testing. This is a
mistake you must avoid! As mentioned earlier, by using the wizards in BIDS, you can create
Chapter 3 Building Effective Business Intelligence Processes 71

OLAP cubes, data mining structures, SSIS packages, and SSRS reports very quickly. This is
fantastic for prototyping and iterative design; however, we’ve seen it have disastrous results
if these quick structures are deployed into production without adequate testing (and sub-
sequent tuning by the developers). This tuning most often involves skillful use of advanced
properties in the various BIDS designers, but it can also include manual code tweaking and
other steps. We’ll devote several future chapters to understanding these fine but important
details.

Suffice it to say at this point that your testing plan should take into account the following
considerations:

■■ It’s critical to test your cubes and mining models with predicted, production-level loads
(from at least the first year). One of the most common mistakes we’ve seen is develop-
ing these structures with very small amounts of data and doing no production-level
testing. We call this the exploding problem of data warehousing.
■■ Your plan should include testing for usability (which should include end-user documen-
tation testing). Usability testing should include query response time for cubes and min-
ing models with production-level data loads put into the cubes and mining models.
■■ You should also test security for all access levels. If your requirements include security
auditing, that should also be tested.

The goal of the testing phase is to gain approval for deployment into production from all
stakeholders in the project and to obtain sign-off from those who will be conducting the
deployment. This sign-off certifies that specification goals have been met or exceeded during
stabilizing and that the production environment has been prepared and is ready to be used.

Deploying
As your solution is moved into production, there should be a plan to create realistic service
level agreements (SLAs) for security, performance (response time), and availability. Also, a
plan for archiving data on a rolling basis should be implemented. SQL Server 2008 has many
features that make archiving easy to implement. We’ll talk more about archiving in Chapter 9,
“Processing Cubes and Dimensions.”

Deployment includes monitoring for compliance with SLA terms over time. The most com-
mon challenge we’ve encountered in the deployment phase is that network administrators
are unfamiliar with BI artifacts such as cubes, mining models, packages, and so on. The most
effective way to mitigate this is to access the knowledge and skills of this group prior to
deployment and to provide appropriate training. Microsoft has a large number of resources
available for BI administrators at http://www.microsoft.com/sqlserver/2008/en/us/events-
webcasts.aspx#BusinessIntelligenceWebcast.
72 Part I Business Intelligence for Business Decision Makers and Architects

Skills Necessary for BI Projects


In this section, we discuss both the required and optional (but nice to have) skills the mem-
bers of your BI team need. Required skills include concepts such as understanding the use
of the BI development tools in SQL Server 2008—not only SSAS and SSRS, but also SSIS. This
section also includes a brief discussion of skills that relate to some of the new features of SQL
Server 2008. The reason we include this information is that you might choose to use SQL
Server 2008 as a data repository for source data based on your understanding of these new
features. Following the discussion of required skills, we cover optional skills for BI teams.

Required Skills
The following is a set of skills that we have found to be mandatory for nearly all of our real-
world BI projects. These skills are most typically spread across multiple people because of
the breadth and depth of knowledge required in each of the skill areas. As you are gathering
resources for your project, a critical decision point is how you choose to source resources
with these skills. Do you have them on staff already? This is unusual. Do you have developers
who are capable of and interested in learning these skills? How will you train them? Do you
prefer to hire outside consultants who have expertise in these skill areas? One of the reasons
we provide you with this information here is because if you want or need to hire outside con-
sultants for your project, you can use the information in this section as a template for estab-
lishing hiring criteria.

Building the Data Storage Containers


The first consideration is who will build the foundation (the data storage containers), which
can consist of OLAP cubes, data mining structures, or both. A required qualification for per-
forming either of these tasks is a high level of proficiency with the developer interfaces in
BIDS for SSAS (both OLAP cubes and data mining structures), as detailed in the following list:

■■ For OLAP cubes We find that understanding best practices for OLAP cube mod-
eling—that is, star schema (sometimes called dimensional modeling), dimensional
hierarchies, aggregation, and storage design—is needed here. We cover these topics
in Chapter 5. As mentioned earlier, also required is an understanding of the appropri-
ate use of BIDS to build SSAS OLAP cubes. Also needed is an understanding of how to
use SQL Server Management Studio (SSMS) to manage cubes. We also find that a basic
understanding of MDX syntax and XMLA scripting are valuable.
■■ For data mining structures We find that understanding best practices for data
mining structure modeling—that is DM modeling concepts, including a basic under-
standing of the capabilities of the data mining algorithms, the functions of input, and
predictable columns—is needed for building a foundation using data mining structures.
It’s also helpful if the person in this role understands the function of nested tables in
modeling. We cover all these topics in future chapters in much more detail.
Chapter 3 Building Effective Business Intelligence Processes 73

Creating the User Interface


The second consideration is related to the building of the user interface (UI). Sensitivity to UI
issues is important here, as is having a full understanding of the current client environment—
that is Excel, Office SharePoint Server 2007, and so on—so that the appropriate client UI
can be selected and implemented. This includes understanding Excel interface capabilities,
PivotTables, and mining model viewers. You should also understand Office SharePoint Server
2007 Report Center and other SharePoint objects such as Excel Services. Excel Services facili-
tates hosting Web-based workbooks via Office SharePoint Server 2007. As mentioned in the
life cycle section earlier in the chapter, sensitivity for and appropriate use of natural busi-
ness taxonomies in UI implementation can be significant in aiding end-user adoption. This is
particularly true for BI projects because the volumes of data available for presentation are so
massive. Here are considerations to keep in mind when creating the user interface:

■■ For reporting in SSRS The first requirement is that team members understand the
appropriate use of BIDS to author SSRS reports based on OLAP cubes and data min-
ing structures. This understanding should include an ability to use report query wiz-
ards, which generate MDX or DMX queries, as well as knowing how to use the report
designer surface to display and format reports properly. Usually report developers will
also need to have at least a basic knowledge of both query languages (MDX and DMX).
Also needed is a basic understanding of Report Definition Language (RDL). It’s also
common for report creators to use the redesigned Report Builder tool. We devote Part
IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business
Intelligence,” to the appropriate use of report designing tools and interfaces.
■■ For reporting in Excel It’s common to expect report designers to be able to imple-
ment Excel PivotTables for reports using OLAP cubes as data sources. Also required is
for report designers to use the Data Mining Add-ins for Excel to create reports that use
SSAS data mining models in Excel.

Understanding Extract, Transform, and Load Processes


A critical skill set for your BI developers to have is an understanding of the powerful SSIS
ETL package designer. As mentioned earlier, it’s not uncommon for up to 75 percent of the
time for the initial setup of a BI project to be around SSIS. This is because of the general
messy state of source data. It’s common to underestimate the magnitude and complexity of
this task, so the more skilled the SSIS developer is, the more rapidly the initial development
phases can progress. It has been our experience that hiring master SSIS developers if none
are present in-house is a good strategy for most new BI projects.

For data preparation or ETL, developers should fully understand the BIDS interface that is
used to build SSIS packages. This interface is deceptively simple looking. As mentioned pre-
viously, we recommend using experienced SSIS developers for most BI projects. Another
important skill is knowing how to use SSMS to manage SSIS packages.
74 Part I Business Intelligence for Business Decision Makers and Architects

Optimizing the Data Sources


The skills needed for this last topic vary greatly. They depend on the quantity and quality of
your data sources. For some projects, you’ll have only a single relational database as a source.
If this is your situation, you probably don’t have much to worry about, especially if you
already have a skilled database administrator (DBA). This person can optimize the data source
for extracting, transforming, and loading (ETL) into the SSAS data structures.

Although using SQL Server 2008 as a data source is actually not required, because you’ll
be using SQL Server 2008 BI tools, you might want to upgrade your data sources. There
are many new features built into SQL Server 2008, such as compression, encryption, easier
archiving, and much more. These new features make using SQL Server 2008 very attrac-
tive for serving up huge amounts of data. We’ll cover these new features in greater depth in
Chapter 9.

Note Is SQL Server 2008 required as a data source for SSAS? The answer is no. You can use
nearly any type of data source to feed data to SSAS. We mention SQL Server 2008 in this section
because you might elect to upgrade an earlier version of SQL Server or migrate from a different
vendor’s RDBMS at the beginning of your BI project. You might choose to upgrade or migrate
because of the availability, scalability, and security enhancements in SQL Server 2008.

Optional Skills
The following is a set of next-level skills we’ve found to be useful for a development team to
have for nearly all of our real-world BI projects. These skills, even more than the basic-level
skills, are most typically spread across multiple people because of the breadth and depth
of knowledge required in each of the skill areas. They’re not absolutely needed for every BI
project, and, in many cases, these are the types of skills you might contract out for during
various phases of your particular BI project.

Building the Foundation


For all types of foundational stores, advanced knowledge of the BIDS design interface for
OLAP cubes and data mining models is quite powerful. This includes knowing how to apply
advanced property configuration options, as follows:

■■ For OLAP cubes This skill set includes advanced use of BIDS to model SSAS OLAP
cubes and to configure advanced cube and mining model properties either via the GUI
design or via the object model (ADOMD.NET API). It also includes MDX native query
and expression authoring skills, as well as advanced XMLA scripting skills.
■■ For data mining structures As with the preceding item, this skill set includes
advanced configuration of data mining structures and modeling using BIDS. It also
includes DMX native query and expression authoring skills.
Chapter 3 Building Effective Business Intelligence Processes 75

Creating the User Interface


Advanced skills for UI creation usually revolve around the addition of more types of UI cli-
ents. While most every project we’ve worked on has used Excel and SSRS, it’s becoming more
common to extend beyond these interfaces for reports, as summarized in the following list:

■■ For reporting in SSRS Advanced skills for SSRS include the use of Report Builder. (This
also usually involves the ability to train business analysts or other similarly skilled users
to use Report Builder to eventually create their own reports.) This skill set also includes
advanced knowledge of the SSRS interface in BIDS, as well as the ability to use the
advanced property configuration settings and perform manual generation of RDL.
■■ For reporting in Excel Advanced knowledge of Excel as an SSAS client includes
understanding the native query functionality of OLAP cubes via extensions to the Excel
query language. This skill set also includes a complete understanding of all the options
available on the Data Mining tab, particularly those related to mining model validation
and management from within Excel.
■■ For reporting in Office SharePoint Server 2007 Advanced OLAP skills in Office
SharePoint Server 2007 entail understanding the capabilities of the built-in Report
Center template and knowing the most effective ways to extend or create custom
dashboards. Also required is an understanding of what type of data source to use to
create dashboards with the greatest level of performance. Data sources to be used with
this skill set can include OLAP cubes and Business Data Catalog (BDC)–connected data.
■■ For reporting in PerformancePoint Server If PerformancePoint Server is used as a
client application for your BI solution, UI designers should be comfortable, at a mini-
mum, with creating custom dashboards. Additionally, PerformancePoint Server is often
used when financial forecasting is a requirement.
■■ For reporting in Microsoft Dynamics or other Customer Relationship Management
(CRM) projects If you plan to customize the OLAP cubes or reports that are imple-
mented by default in Microsoft Dynamics or other CRM products, you need to have a
deep understanding of the source schemas for those cubes. Also, you need to carefully
document any programmatic changes you make so that you can ensure a workable
upgrade path when both SQL Server and Microsoft Dynamics release new product ver-
sions, service packs, and so on.
■■ For custom client reporting .NET programming experience is preferred if custom cli-
ent development is part of your requirements. Skills needed include an understanding
of the embeddable data mining controls as well as linked (or embedded) SSRS that use
OLAP cubes as data sources.
76 Part I Business Intelligence for Business Decision Makers and Architects

Understanding Extract, Transform, and Load Processes


As mentioned several times previously, we’ve found the work involved in locating, prepar-
ing, cleaning, validating, and combing disparate source data to be quite substantial for nearly
every BI project we’ve been involved with. For this reason, advanced SSIS skills, more than
advanced skills for any other part of the BI stack (that is, SSAS, SSRS), are most critical to
ensuring that BI projects are implemented on time and on budget.

For data preparation or ETL, in addition to mastering all the complexities of the SSIS pack-
age creation interface in BIDS (which includes configuration of advanced control flow and
data flow task and package properties), advanced SSIS developers need to be comfortable
implementing SSIS scripting, SSIS debugging, and SSIS logging. They should also understand
the SSIS object model. In some cases, other products, such as BizTalk Server (orchestrations)
might also be used to automate and facilitate complex data movement.

Optimizing the Data Sources


Your advanced understanding of optimization of underlying data sources is completely
dependent on those particular stores. We’ll cover only one case here—SQL Server 2008.

BI developers who have advanced skills with SQL Server 2008 used as a BI data source will
understand query tuning and indexing, and they’ll be able to optimize both for the most effi-
cient initial data load or loads, as well as for incremental updates. They’ll also understand and
use new security features, such as transparent encryption to improve the security of sensitive
data. Additionally, they’ll know about and understand the enhancements available to increase
maintenance windows; these enhancements can include using backup compression, parti-
tioning, and other functionality.

Forming Your Team


With this very long and imposing skills list, you might be wondering just how you accomplish
all this. And if you are thinking that, we’ve proved our point—you can’t. Microsoft has worked
hard to make BI accessible; however, the reality remains that BI projects require certain skills
in order for them to be successful. It’s common to bring in specialized consultants during var-
ious phases of a project’s life cycle to complement the skills you and your team bring to the
table. Another common response is to include a training plan for the internal team as part of
a BI project.

Roles and Responsibilities Needed When Working with MSF


Once again, the flexible MSF guidance can be helpful as you assemble your team. MSF
includes the notion of roles and responsibilities, which are tied to milestones and deliver-
ables from the MSF process model. Figure 3-4 shows the guidance for roles. Don’t be misled
Chapter 3 Building Effective Business Intelligence Processes 77

into thinking that MSF advocates for a rigid number of people for each project—that is, one
to fit each of the seven MSF roles shown in the diagram. Quite the opposite: the roles and
responsibilities can be combined to scale to teams as small as three people, and they can be
expanded to scale to teams with hundreds of members.

It is important to note the connections between the roles, particularly the central node,
which is labeled Team of Peers in the diagram in Figure 3-4. The important concept to con-
sider here is that although all team members will contribute in different ways to the project
using their various skill sets, each member of the team will make a critical contribution to the
project. If any one member cannot perform at a high level, the project has a significant risk of
not being delivered on time and on budget, and it might also not meet the quality bar estab-
lished at the start of the project.

Project Manager Architect

Program
Architecture
Management

Business
Analyst
Product
Development
Management
Team of
Developer
Peers Database Developer

User
Test
Experience

Business Tester
Analyst Release
Operations

Release Manager
Database Administrator

FIgure 3-4 The seven major cluster groups for roles in MSF Agile

MSF Agile further associates roles with particular project responsibilities. Figure 3-4 applies
MSF Agile rather than classic MSF. Note that MSF Agile separates the Program Management
and Architecture roles. This option has worked well for us in BI projects. Usually, internal staff
manages the program (or project)—controlling the schedule and budget—and external con-
sultants (the authors of this book, for example) perform architecture tasks.

Downscaling is rather obvious: simply combine roles according to the skills of team members.
Upscaling requires a bit of explanation. MSF advocates for teams that report to leads for each
role; those teams are then responsible for particular feature sets of the solutions.

We’ll spend the rest of this chapter examining the basic definitions for MSF roles and relating
those definitions to typical BI responsibilities.
78 Part I Business Intelligence for Business Decision Makers and Architects

Product Management
The product manager initiates the project and communicates most project information to
the external sponsors. These sponsors are external to the production team, but they’re usu-
ally part of the company in which the solution is being implemented. We’ve found exceptions
to the latter statement in some larger enterprises, particularly government projects. These
large enterprises often have stricter reporting requirements and also request more detailed
envisioning and specification documents. In many organizations, the product manager is also
the business analyst and acts as the communication bridge between the IT project and busi-
ness teams.

All BI projects need a product manager. In addition to initiating the project, which includes
talking with the stakeholders to get their input before the project starts, the product man-
ager is also responsible for getting the problem statement and then working with the team
to formulate the vision/scope statement and the functional specification, budget estimates,
and timeline estimates. The product manager is also the person who presents the vision/
scope and functional specification documents (which include budget and schedule) to the
stakeholders for formal approval. The product manager is one of the primary external-facing
team members. The product manager communicates the status of the project to the stake-
holders and asks for more resources on the part of the team if the circumstances or scope of
the project changes. The following list summarizes the duties of the product manager:

■■ Acts as an advocate for the stakeholders


■■ Drives the shared project vision and scope
■■ Manages the customer requirements definition
■■ Develops and maintains the business case
■■ Manages customer expectations
■■ Drives features versus schedule versus resources tradeoff decisions
■■ Manages marketing, evangelizing, and public relations
■■ Develops, maintains, and executes the communications plan

Architecture
The duties of the BI project architect vary widely depending on the scope of the project. For
smaller projects (particularly those using Agile software development life cycle methodolo-
gies), it’s common that the person who designs the project architecture is the same person
who implements that architecture as a BI developer. Some of the skills that a qualified BI
architect should have are the ability to translate business requirements into appropriate star
schema and data mining structure models, the ability to model OLAP cubes (particularly
dimensions), and the ability to model data for inclusion in one or more data mining models.
For this last skill, an understanding of data mining algorithm requirements, such as supported
data types, is useful.
Chapter 3 Building Effective Business Intelligence Processes 79

Also, BI architects should be comfortable using one or more methods of visually document-
ing their work. Of course, in the simplest scenarios, a whiteboard works just fine. For most
projects, our architects use a more formal data modeling tool, such as Visio or ERwin.

Program Management
In some implementations of MSF, the program manager is also the project architect. The
Program Manager role is responsible for keeping the project on time and on budget.
Alternatively, in smaller projects, the program manager can also perform some traditional
project manager duties, such as schedule and budget management. In larger projects, the
program manager brings a person with project management skills onto the team to per-
form this specialized function. In the latter case, the program manager is more of a project
architect. As previously mentioned, in larger BI projects, separate people hold the Program
Manager and Architect roles.

The program manager is the glue that holds the team together. It’s the primary responsibil-
ity of this person to translate the business needs into specific project features. The program
manager doesn’t work in a vacuum; rather, she must get input from the rest of the team.
During the envisioning and developing phases, input from the developer and user experi-
ence leads is particularly important. During later project phases, input from the test and
deployment leads takes center stage. The program manager communicates with the team
and reports the project’s status to the product manager. The program manager is also the
final decision maker if the team can’t reach consensus on key issues.

Here is a brief list of the program manager’s duties:

■■ Drives the development process to ship the product on time and within budget
■■ Manages product specifications and is the primary project architect
■■ Facilitates communication and negotiation within the team
■■ Maintains the project schedule, and reports project status
■■ Drives implementation of critical tradeoff decisions
■■ Develops, maintains, and executes the project master plan and schedule
■■ Drives and manages risk assessment and risk management

Development
The developer manager is either the lead BI developer (in smaller projects) or acts as a more
traditional manager to a group of BI developers. These developers usually consist of both
internal staff and external consultants. As mentioned earlier, this combination is most often
used because of the complexity and density of skills required for BI developers.
80 Part I Business Intelligence for Business Decision Makers and Architects

The role that the developer manager (sometimes called the dev lead) plays depends greatly
on the size of the developer team. In smaller projects, there might be only a few developers.
In those situations, the developer manager also performs development tasks. In larger teams,
the developer manager supervises the work of the team’s developers.

Here is a brief list of the developer manager’s duties:

■■ Specifies the features of physical design


■■ Estimates the time and effort to complete each feature
■■ Builds or supervises the building of features
■■ Prepares the product for deployment
■■ Provides technology subject matter expertise to the team

The most important consideration for your project team is determining the skills of your
developers. We’ve seen a trend of companies underestimating the knowledge ramp for .NET
developers to SSAS, SSIS, and SSRS competency. Although the primary tools used to work
with these services are graphical—that is, BIDS and SSMS—the ideas underlying the GUIs are
complex and often new to developers. Assuming that .NET developers can just pick up all
the skills needed to master BI—particularly those needed for modeling the underlying data
store structures of OLAP cubes and data mining models—is just not realistic. Also, the SSIS
interface (and associated package-creation capabilities) is quite powerful, and few working,
generalized .NET developers have had time to fully master this integrated development envi-
ronment. Finally, in the area of reporting, the variety of client interfaces that are often asso-
ciated with BI projects can demand that developers have a command of Excel, SSRS, Office
SharePoint Server 2007, PerformancePoint Server, or even custom client creation.

The skills a developer manager needs include the following: access (current skills), create
(skills gap map), and assign (training of current developers or delegating tasks to other devel-
opers who have these skills). A skills gap map is a document that lists required project team
member skills and then lists current team member skills. It then summarizes gaps between
the skills the team currently has and those that the team needs. This type of document is
often used to create a pre-project team training plan. Because BI skills are often new to tra-
ditional developers, it was a common part of our practice to design and implement team
training prior to beginning work on a BI project. This training included BI concepts such as
star schema modeling, as well as product-specific training, such as a step-by-step reviewing
of how to build an OLAP cube using SSAS and more.

In our experience, every BI project needs at least three types of developers: an SSAS devel-
oper, an SSIS developer, and an SSRS developer. In smaller shops, it’s common to hire outside
contractors for some or all of these skills because that is more cost-effective than training the
current developer teams.
Chapter 3 Building Effective Business Intelligence Processes 81

Who Is a BI Developer?
It’s interesting to consider what types of prerequisite skills are needed to best be able to
learn BI development. Much of the information that Microsoft releases about working
with SQL Server 2008 BI is located on its TechNet Web site (http://technet.microsoft.com).
This site is targeted at IT professionals (network and SQL administrators). We question
whether this is actually the correct target audience. The key question to consider is this:
Can an IT administrator effectively implement a BI solution?

It has been our experience that traditional developers, particularly .NET develop-
ers, have the skills and approach to do this on a faster learning curve. We think this is
because those developers bring an understanding of the BI development environment
(BIDS), from having used Visual Studio. This gives them an advantage when learning the
BI toolset. Also, we find that most of these developers have had experience with data
modeling and data query. Of course their experience is usually with relational rather
than multidimensional databases. However, we do find that teaching OLAP and data
mining database development skills is quicker when the students bring some type of
database experience to the table.

Another tendency we’ve seen is for managers to assume that because the BIDS devel-
opment environments are GUI driven (rather than code driven), nonprogrammers
can implement BI solutions. Although this might be possible, we find that developers,
rather than administrators, can gain a quicker understanding of core BI concepts and
tools. Of course, to every guideline there are exceptions. We certainly have also worked
with clients who ARE administrators, who have successfully implemented BI projects.

Note Although MSF is flexible in terms of role combinations for smaller teams, one best prac-
tice is worth noting here. MSF strongly states that the Developer and Tester roles should never
be combined. Although this might seem like common sense, it’s surprising to see how often this
basic best practice is violated in software projects. If developers were capable of finding their
own bugs, they probably wouldn’t write them in the first place. Avoid this common problem!

Test
The test manager is the lead tester in smaller projects and the test manager of a team in
larger projects. The test manager has the following responsibilities:

■■ Ensure that all issues are known


■■ Develop testing strategy and plans
■■ Conduct testing
82 Part I Business Intelligence for Business Decision Makers and Architects

Like all effective testing, BI testing should be written into the functional specification in the
form of acceptance criteria—that is, “Be able to perform xxx type of query to the cube and
receive a response within xxx seconds.” As mentioned previously in this chapter, effective
testing is comprehensive and includes not only testing the cubes and mining models for
compliance with the specifications, but also end-user testing (using the client tools), security
testing, and, most important of all for BI projects, performance testing using a projected
real-world load. Failing to complete these tasks can lead to cubes that work great in develop-
ment, yet fail to be usable after deployment.

User Experience
The user experience manager’s primary role is developing the user interface. As with test and
developer managers, the user experience manager performs particular tasks in smaller teams
and supervises the work of team members in larger ones.

The following list summarizes the responsibilities of the user experience manager:

■■ Acts as user advocate


■■ Manages the user requirements definition
■■ Designs and develops performance support systems
■■ Drives usability and user performance-enhancement tradeoff decisions
■■ Provides specifications for help features and files
■■ Develops and provides user training

The user experience manager needs to be aware of the skills of the designers working on the
user interface. Many BI projects suffer from ineffective and unintuitive user interfaces for the
data stores (OLAP cubes and mining models). The problem of ineffective visualization of BI
results for users often hinders broader adoption. With the massive amounts of data available
in OLAP cubes, old metaphors for reporting, such as tables and charts, aren’t enough and
important details, such as the business taxonomy, are often left out. We’ve actually seen an
OLAP cube accessed by an Excel PivotTable that included a legend to translate the field views
into plain English.

A successful user interface for any BI project must have the following characteristics:

■■ As simple as is possible, including the appropriate level of detail for each audience
group
■■ Visually appealing
■■ Written in the language of the user
■■ Linked to appropriate levels of detail
Chapter 3 Building Effective Business Intelligence Processes 83
■■ Can be manipulated by the users (for example, through ad hoc queries or pivot tables)
■■ Visually intuitive, particularly when related to data mining

In addition to trying to increase usability, you should include an element of fun in the inter-
face. Unfortunately, all too often we see boring, overly detailed and rigid interfaces. Rather
than these interfaces being created by designers, they’re created by developers, who are usu-
ally much more skilled in implementing business logic than in creating a beautiful UI design.
In Part IV, which is devoted to UI (reporting), we’ll include many examples that are effective.

Note Microsoft Research contains an entire group, the data visualization group, whose main
function is to discover, create, and incorporate advanced visualizations into Microsoft products.
This group has made significant contributions to SQL Server 2008 (as well as to previous versions
of SQL Server), for business intelligence in particular. For example, they created some of the core
algorithms for data mining as well some of the visualizers included for data mining models in
BIDS. Go to http://research.microsoft.com/vibe/ to view some of their visualization tools.

Release Management
The release/operations manager is responsible for a smooth deployment from the develop-
ment environment into production and has the following responsibilities:

■■ Acts as an advocate for operations, support, and delivery channels


■■ Manages procurement
■■ Manages product deployment
■■ Drives manageability and supportability tradeoff decisions (also known as
compromises)
■■ Manages operations, support, and delivery channel relationship
■■ Provides logistical support to the project team

Release/operations managers must be willing to learn specifics related to cube and min-
ing model initial deployment, security, and, most important, maintenance. In our consulting
work, we find that the maintenance phase of BI projects is often overlooked. Here are some
questions whose answers will affect the maintenance strategy:

■■ What is the projected first-year size of the cubes and mining models?
■■ What is the archiving strategy?
■■ What is the physical deployment strategy?
■■ What is the security auditing strategy?
84 Part I Business Intelligence for Business Decision Makers and Architects

Summary
In this chapter, we looked at two areas that we’ve seen trip up more than one well-
intentioned BI project: managing the software development life cycle and building the BI
team. We provided you with some guidance based on our experience in dealing with the
complexity of BI project implementation. In doing that, we reviewed the software develop-
ment life cycle that we use as a basis for our projects—MSF. After taking a look at generic
MSF, we next applied MSF to BI projects and mentioned techniques and tips particular to the
BI project space. We then looked at MSF’s guidance regarding team building. After reviewing
the basic guidance, which includes roles and responsibilities, we again applied this informa-
tion to BI projects.

In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we turn our attention


to the modeling processes associated with BI projects. We look in detail at architectural con-
siderations regarding physical modeling. The discussion includes describing physical servers
and logic servers needed to begin planning, prototyping, and building your BI project. In
Chapter 4, we’ll also detail best practices for setting up your development environment,
including a brief discussion of the security of data.
Chapter 4
Physical Architecture in Business
Intelligence Solutions
In this chapter, we turn to the nitty-gritty details of preparing the physical environment for
developing your business intelligence (BI) project. We cover both physical server and ser-
vice installation recommendations. Our goal is to help your team get ready to develop your
project. We also look at setup considerations for test and production environments. Then we
include a discussion of the critical topic of appropriate security for these environments. So
often in our real-world experience, we’ve seen inadequate attention given to many of these
topics (especially security), particularly in the early phases of BI project implementations. We
conclude the chapter with a discussion of best practices regarding source control in a team
development environment.

Planning for Physical Infrastructure Change


As you and your team move forward with your BI project, you’ll need to consider the alloca-
tion and placement of key physical servers and the installation of services for your devel-
opment, test, and production environments. Although you can install all the components
of Microsoft SQL Server 2008 BI onto a single physical server, other than for evaluation or
development purposes, this is rarely done in production environments. The first step in the
installation process is for you to conduct a comprehensive survey of your existing network
environment. You need to do this so that you can implement change in your environment in
a way that is planned and predictable and that will be successful. Also, having this complete
survey in hand facilitates rollback if that need arises during your BI project cycle.

Creating Accurate Baseline Surveys


To get started, you need to conduct a comprehensive survey of the existing environment.
If you’re lucky, this information is already available. In our experience, however, most exist-
ing information is incomplete, inaccurate, or missing. You can use any convenient method of
documenting your findings. We typically use Microsoft Office Excel to do this. This activity
should include gathering detailed information about the following topics:

■■ Physical servers name, actual locations, IP addresses of all network interface cards
(NICs), domain membership Servers documented should include authentication
servers, such as domain controllers; Web servers; and data source servers, such as file
servers (for file-based data sources, such as Excel and Microsoft Access) and RDBMS

85
86 Part I Business Intelligence for Business Decision Makers and Architects

servers. If your BI solution must be accessible outside of your local network, include
perimeter devices, such as proxy servers or firewalls in your documentation. You should
also include documentation of open ports.
■■ Operating system configuration of each physical server operating system version,
service packs installed, administrator logon credentials, core operating system ser-
vices installed (such as IIS), and effective Group Policy object (GPO) settings If your
solution includes SQL Server Reporting Services (SSRS), take particular care in docu-
menting Internet Information Services (IIS) configuration settings.

Note IIS 6.0 (included in Windows Server 2003) and IIS 7.0 (included in Windows Vista
and Windows Server 2008) have substantial feature differences. These differences as they
relate to SSRS installations will be discussed in much greater detail in Chapter 20, “Creating
Reports in SQL Server 2008 Reporting Services.” For a good general reference on IIS, go to
http://www.iis.net.

■■ Logical servers and services installed on each physical server This includes the
name of each service (such as SQL Server, Analysis Services, and so on) and the version
and service packs installed for each service. It should also include logon credentials
and configuration settings for each service (such as collation settings for SQL Server).
You should also include the management tools installed as part of the service instal-
lation. Examples of these are Business Intelligence Development Studio (BIDS), SQL
Server Management Studio (SSMS), SQL Profiler, and so on for SQL Server. Services
documented should include SQL Server, SQL Server Analysis Services (SSAS), SQL Server
Integration Services (SSIS), and SSRS.
■■ Development tools This includes the names, versions, and configuration information
for all the installed development tools, such as Microsoft Visual Studio. Visual Studio is
not strictly required for BI development; however, in many cases you’ll find its features,
such as Microsoft IntelliSense for .NET code, useful for advanced BI development.
■■ Samples and optional downloads As mentioned previously in this book, all samples
for SQL Server 2008 are now available from http://www.codeplex.com. Samples are
not part of the installation DVD. We do not recommend installing samples in a pro-
duction environment; however, we do generally install the samples in development
environments. We also find a large number of useful tools and utilities on CodePlex.
Documenting the use of such tools facilitates continuity in a team development envi-
ronment. An example of these types of tools is the MDX Script Performance Analyzer,
found at http://www.codeplex.com/mdxscriptperf.

After you’ve gathered this information, you should plan to store the information in both elec-
tronic and paper versions. Storing the information in this way not only aids you in planning
for change to your environment but is a best practice for disaster-recovery preparedness. The
Chapter 4 Physical Architecture in Business Intelligence Solutions 87

information in the preceding list is the minimum amount you’ll need. In some environments,
you’ll also need to include information from various log files, such as the Windows Events
Viewers, IIS logs, and other custom logs (which often include security logs).

Assessing Current Service Level Agreements


Service level agreements (SLAs) are being increasingly used to provide higher levels of pre-
dictability in IT infrastructures. If your organization already uses SLAs, reviewing the stan-
dards written into them should be part of your baseline survey. Your goal, of course, is to
improve report (query) performance by introducing BI solutions into your environment.

If your company does not use SLAs, consider attempting to include the creation of a
BI-specific SLA in your BI project. An important reason for doing this is to create a basis for
project acceptance early in your project. This also creates a high-level set of test criteria. A
simple example is to assess appropriate query responsiveness time and to include that met-
ric. For example, you can state the criteria in phrases like the following: “Under normal load
conditions (no more than 1000 concurrent connections), query response time will be no more
than 3 seconds.”

Another common component of SLAs is availability, which is usually expressed in terms of


the nines (or “9’s”). For example, an uptime measurement of “five nines” is roughly equiva-
lent to 99.999 percent, or up to about 5 total hours of unplanned downtime per year. By
comparison, “four nines” (or 99.99 percent) is roughly equivalent to about 55 total hours of
unplanned downtime per year. We’ll talk a bit more about availability strategies later in this
chapter.

Note For an example of a simple, downloadable (for a fee) SLA, see http://www.sla-world.com/
sladownload.htm.

What if you do not have or plan to use SLAs? You can and should still assess your current
operating environments to create meaningful baseline information. This assessment should,
at a minimum, consist of a simple list of pain points regarding data and reporting. You’ll use
this information at the beginning of your project for inclusion in the problem statement por-
tion of your specification documents. Alleviating or at least significantly reducing this pain is,
of course, the ultimate goal of your BI project. We recommend using the Windows Reliability
and Performance Monitor tool to capture current conditions. After you’ve installed SSAS, if
you want to collect information about SSAS itself, several groups of counters are specific to
it. These counter names all have the format of MSAS 2008:<CounterName>. After you install
SSIS or SSRS, additional performance counters specific to those services also become avail-
able. The default view of this tool in Windows Server 2008 is shown in Figure 4-1.
88 Part I Business Intelligence for Business Decision Makers and Architects

FIgure 4-1 Windows Server 2008 Reliability and Performance Monitor tool

Considerations here can include the following possible problems:

■■ Slow report rendering as a result of current, unoptimized OLTP sources.


■■ Slow query execution as a result of unoptimized or underoptimized OLTP queries.
■■ Excessive overhead on core OLTP services as a result of contention between OLTP and
reporting activities running concurrently. This overhead can be related to CPU, mem-
ory, disk access contention, or any combination of these.
■■ Short OLTP maintenance windows because of long-running backup jobs and other
maintenance tasks (such as index defragmentation).

After you’ve completed your baseline environment survey, your next preparatory step is to
consider the number of physical servers you’ll need to set up your initial development envi-
ronment. There are two ways to approach this step. The first is to simply plan to set up a
working development environment with the intention of creating or upgrading the produc-
tion environment later. The second is to plan and then build out all working environments,
which could include development, test, and production.

The approach that is taken usually depends on the size and maturity of the BI project. For
newer, smaller projects, we most often see companies choose to purchase one or two new
Chapter 4 Physical Architecture in Business Intelligence Solutions 89

servers for the development environment with the intention of upgrading or building out the
production environment as the project progresses.

Determining the Optimal Number and Placement


of Servers
When you’re planning for new servers, you have many things to consider. These can include
the following:

■■ Source systems If your source systems are RDBMSs, such as an earlier version of SQL
Server or some other vendor’s RDBMS (Informix, Sybase, and so on), you might want
to upgrade source data systems to SQL Server 2008 to take advantage of BI-related
and non-BI-related enhancements. For more information about enhancements
to the relational engine of SQL Server 2008, go to http://download.microsoft.com/
download/C/8/4/C8470F54-D6D2-423D-8E5B-95CA4A90149A/SQLServer2008_OLTP_
Datasheet.pdf.
■■ Analysis Services Where will you install this core service for BI? We find that for
smaller BI projects, some customers elect to run both SQL Server 2008 and Analysis
Services on the same physical machine. There, they run both OLAP and extract, trans-
form, and load (ETL) processes. This configuration is recommended only for smaller
implementations—fewer than 100 end users and less than 5 GB of data in an OLAP
cube. However, when creating a development environment, it is common to install both
SQL Server and Analysis Services on the same machine. Another common way to set up
a development environment is to use multiple virtual machines. The latter is done so
that the security complexities of multiple physical server machine installs can be mir-
rored in the development or test environments.
■■ Integration Services Where will you install this service? As mentioned, for develop-
ment, the most common configuration we see is a SQL Server instance, primarily for
hosting SSIS packages, installed on the same machine as SSAS. In production, the most
common configuration is to install the SSIS component for the SQL Server instance on a
separate physical machine. This is done to reduce contention (and load) on the produc-
tion SSAS service machine.
■■ Reporting Services Where will you install this service? As with SSIS, in development,
we usually see an SSRS instance installed on the same machine as SSAS and SSIS. Some
customers elect to separate SSAS or SSIS from SSRS in the development phase of their
BI project because, as mentioned earlier in this section, this type of configuration can
more closely mirror their typical production environment. Production installations vary
greatly depending on the number of end users, cube size, and other factors. In produc-
tion, at a minimum, we see SSRS installed on at least one dedicated server.
90 Part I Business Intelligence for Business Decision Makers and Architects

Tip Like SQL Server itself, SSAS installations support multiple instances on the same server. This
is usually not done in development environments, with one important exception. If you’re us-
ing an earlier version of SSAS, it’s common to install multiple instances in development so that
upgradeability can be tested.

More often, we see a typical starting BI physical server installation consisting of at least two
new physical servers. SSAS and SQL Server are installed on the first server. (SSIS is run on
this server.) SSRS is installed on the second server. We consider this type of physical consid-
eration to be a starting setup for most customers. Many of our customers elect to purchase
additional servers for the reasons listed in the component list shown earlier—that is, to scale
out SSRS, and so on. We also choose to add physical servers in situations where hard security
boundaries, differing server configuration settings, or both are project requirements. Figure
4-2 illustrates this installation concept. It also shows specifically which logical objects are part
of each SSAS instance installation. We’ll talk more about these logical objects in Chapter 5,
“Logical OLAP Design Concepts for Architects.”

AMO
Application

Server
Analysis Services
Instance
Analysis
Services Server Object
Instance
Database Object Data Source
Data

Helper Objects
Analysis
Services Data
OLAP Objects Source
Instance
Views
Data Mining Objects

Data Source Data


Database ... Database
Object Object

FIgure 4-2 Multiple SSAS instances on a single physical server (illustration from SQL Server Books Online)

Note An important consideration when you’re working with existing (that is, previous ver-
sions, such as SQL Server 2000 or earlier) SQL Server BI is how you’ll perform your upgrade of
production servers for SSAS, SSIS, and SSRS. For more information about upgrading, see the
main Microsoft SQL Server Web site at http://www.microsoft.com/sql, and look in the “Business
Intelligence” section.
Chapter 4 Physical Architecture in Business Intelligence Solutions 91

For large BI projects, all components—that is, SSAS, SSIS, and SSRS—can each be scaled to
multiple machines. It’s more common to scale SSAS and SSRS than ETL to multiple machines.
We’ll cover this in more detail in Chapter 9, “Processing Cubes and Dimensions,“ for SSAS and
in Chapter 22, “Advanced SQL Server 2008 Reporting Services,” for SSRS. For SSIS scaling,
rather than installing SSIS on multiple physical machines, you can use specific SSIS package
design techniques to scale out execution. For more information, see Chapter 16, “Advanced
Features in Microsoft SQL Server 2008 Integration Services.”

Tip Because of the large number of considerations you have when planning a BI project, some
hardware vendors (for example, Dell) have begun to provide guidance regarding proper server
sizing for your BI projects. For more information, go to the Dell Web site at http://www.dell.com.

Considerations for Physical Servers


In most environments, you’ll want to use, at a minimum, one uniquely dedicated physical
server for your BI project. As mentioned, most of our clients start by using this server as a
development server. Your goal in creating your particular development environment depends
on whether or not you’ll also be implementing a test environment. If you are not, you’ll want
to mirror the intended production environment (as much of it as you know at this point in
your project) so that the development server can function as both the development server
and test server.

The number and configuration of physical servers for your production environment will vary
greatly depending on factors like these: the amount of data to be processed and stored, the
size and frequency of updates, and the complexity of queries. A more in-depth discussion of
scalability can be found in Chapter 9.

Note If you read in SQL Server Books Online that “BIDS is designed to run on 32-bit servers
only,” you might wonder whether you can run BIDS on an x64 system. The answer is yes, and
the reason is that it will run as a WOW (or Windows-on-Windows 32-bit) application on x64
hardware.

Other considerations include the possibility of using servers that have multiple 64-bit pro-
cessors and highly optimized storage—that is, storage area networks (SANs). Also, you can
choose to use virtualization technologies, such as Virtual Server or Hyper-V, to replicate
the eventual intended production domain on one or more physical servers. Figure 4-3
shows a conceptual view of virtualization. For more information about Windows Server
2008 virtualization options, see http://www.microsoft.com/windowsserver2008/en/us/
virtualization-ent.aspx.
92 Part I Business Intelligence for Business Decision Makers and Architects

Physical Servers

Virtualization

Virtual Host

Virtual
Guests

FIgure 4-3 Using virtualization technologies can simplify BI development environments.

Server Consolidation
When you are setting up the physical development environment for your BI project, you
don’t have to be too concerned with server consolidation and virtualization. However, this
will be quite different as you start to build your production environment. Effective virtual-
ization can reduce total cost of ownership (TCO) significantly. There are a few core consid-
erations for you to think about at this point in your project so that you can make the best
decisions about consolidation when the time comes:

■■ Know baseline utilization (particularly of CPUs).


■■ Consider that tempdb will be managed in a shared environment, which involves impor-
tant limits such as the use of a single collation and sort order.
■■ Remember that MSDB server settings, such as logon collisions between servers, must
be resolved prior to implementing consolidation.

Considerations for Logical Servers and Services


BI projects contain many parts. At a minimum, you’ll probably work with SSAS, SSIS, and
SSRS. Given that, you need to consider on which physical, virtual, or logical server you’ll need
to install each instance of the aforementioned services. You’ll also need to install all prerequi-
Chapter 4 Physical Architecture in Business Intelligence Solutions 93

sites on the operating system as well as development and management tools, such as SSMS
and BIDS.

As shown in Figure 2-1 in Chapter 2, “Visualizing Business Intelligence Results,” you can have
quite a large number of items to install for a BI solution. In addition to the core services—
SSAS, SSIS, and SSRS—you’ll install tools such as BIDS, SSMS, and SQL Profiler, as well as other
services, such as Microsoft Office SharePoint Server 2007, Dynamics, or PerformancePoint
Server.

Each of these services contains many configuration options. It is important to understand


that the BI stack, like all of SQL Server 2008, installs with a bare-minimum configuration.
This is done to facilitate the use of security best practices. In particular, it’s done to support
the idea of presenting a reduced attack surface. If your solution requires more than minimal
installation—and it probably will—it’s important that you document the specific installation
options when you set up the development environment. You can access these setup options
in a couple of ways, the most common of which is to connect to the service via SSMS and
then to right-click on the instance name to see its properties. Figure 4-4 shows just some of
the service configuration options available for SSAS. Unless you have a specific business rea-
son to do so, you most often will not make any changes to these default settings. You’ll note
that the Show Advanced (All) Properties check box is not selected in Figure 4-4. Selecting this
check box exposes many more configurable properties as well.

FIgure 4-4 SSAS server configuration settings in SSMS


94 Part I Business Intelligence for Business Decision Makers and Architects

One example of this configuration is to document the service logon accounts for each of the
involved services—that is, for SSAS, SSIS, and SSRS. Of course, all these services connect to
one or more data sources, and those data sources each require specific logon credentials.
We’ve found over and over in BI projects that failure to adequately document informa-
tion such as this to be a big time waster over the lifetime of the project. We’ve spent many
(expensive, billable!) hours chasing down connection account credentials.

You’ll definitely want to use the SQL Server Configuration Manager to verify all service set-
tings. It’s particularly useful as a quick way to verify all service logon accounts. Figure 4-5
shows this tool.

FIgure 4-5 SQL Server Configuration Manager

Other considerations include what your security baselines should be when installing new
services. Determining this might require coordination with your security team to plan for
changes to group policies to restrict usage of various tools available with BI services. An
example of this is the use of SQL Profiler, which allows traces to be created for SSAS activity.
Also, you need to decide how to implement security based on the type of client environment
your development environment should contain. As mentioned, most BI projects will include
some sort of SSRS implementation. However, some projects will include other products, such
as PerformancePoint Server and Microsoft Office SharePoint Server 2007.

Office SharePoint Server 2007 requires both server and client access licenses, unless we spe-
cifically say otherwise throughout this book. Office SharePoint Server 2007 includes many
rich features commonly used in BI projects, such as the Report Center site template. Your
project might also include Windows SharePoint Services, a free download for Windows
Server 2003 or later. Of course, the more complex the development (and eventual produc-
tion) environments, the more important detailed documentation of service configuration
becomes.

Other possible client installation pieces are Office 2007 and the Data Mining Add-ins for SQL
Server 2008. There are also many third-party (commercial) custom client interface tools avail-
able for purchase.
Chapter 4 Physical Architecture in Business Intelligence Solutions 95

understanding Security requirements


As with scalability, we’ll get to a more comprehensive discussion of security later in this
book. We include an introduction to this topic here because we’ve noted that some custom-
ers elect to use a copy (or a subset) of production data in development environments. If this
is your situation, it’s particularly critical that you establish appropriate security for this data
from the start. Even if you use masked or fake data, implementing a development environ-
ment with least-privilege security facilitates an appropriate transition to a secure production
environment.

Note You can use a free tool to help you plan security requirements. It’s called the Microsoft
Security Assessment Tool, and it’s located at https://www.microsoft.com/technet/security/tools/
msat/default.mspx.

Several new features and tools are available with SQL Server 2008 (OLTP) security, such as the
new Declarative Management Framework (DMF) policies and the Microsoft Baseline Security
Analyzer (which can be downloaded from https://www.microsoft.com/technet/security/tools/
mbsahome.mspx). Most of these tools are not built to work with SSAS, SSIS, and SSRS security.
For this reason, you should be extra diligent in planning, implementing, and monitoring a
secure BI environment. We stress this because we’ve seen, more often than not, incomplete
or wholly missing security in this environment. As problems such as identity theft grow, this
type of security laxness will not be acceptable.

Security Requirements for BI Solutions


When you’re considering security requirements for your BI solution, following the basic best
practice of (appropriate) security across all tiers of the solution is particularly important. BI
projects are, after all, about enterprise data. They also involve multiple tiers, consisting of
servers, services, and more. Figure 4-6 gives you a high-level architectural view of the BI
solution landscape.
96 Part I Business Intelligence for Business Decision Makers and Architects

COM-Based
Client .NET Client
Win32
Applications for Applications
Applications
OLAP and/or for OLAP
for OLAP
Data Mining and/or
and/or
Data Mining Data Mining

ADO MD

Any
Application
OLE DB for OLAP ADO MD.NET for OLAP and
Data Mining

XMLA
over
HTTP
(WAN)
XMLA
IIS – OR – over
Data Pump TCP/IP

XMLA
over
TCP/IP

Instance of SQL Server 2008


Analysis Services

Data Sources
Integration
Services
Packages
Relational Databases

FIgure 4-6 In planning BI solution security, it’s important to consider security at each tier.

Source Data: Access Using Least-Privileged Accounts


As discussed, you should have collected connection information for all source data as part
of your baseline survey activities. Connect to source data using the security best practice of
least privilege—that is, use nonadministrator accounts to connect to all types of source data.
This includes both file-based and relational source data. We’ve found that in some cases,
we’ve actually had to have reduced-privilege accounts created in the source systems specifi-
cally for BI component connectivity.
Chapter 4 Physical Architecture in Business Intelligence Solutions 97

Note If you’re using SQL Server 2008 as one of your source data repositories, you can take
advantage of many new security enhancements available in the OLTP source layer. A few
examples include the following: declarative policy management, auditing, and transparent data
encryption. For more information, see http://download.microsoft.com/download/d/1/b/d1b8d9aa-
19e3-4dab-af9c-05c9ccedd85c/SQL%20Server%202008%20Security%20-%20Datasheet.pdf.

Data in Transit: Security on the Wire


You might be tempted to skip the key security consideration step of protecting data on
the wire, or in transit, between systems. This would be an unfortunate oversight. Security
breaches happen, on average, at a rate of three-to-one inside the firewall versus outside of it.
It’s naïve to believe that such issues will not arise in your environment. Sensitive or personal
data—such as credit card information, Social Security numbers, and so on—must be pro-
tected in transit as well as while stored. We most commonly see either network-level (such as
IPSec) or application-level (HTTPS) security performing this function.

Processing Layer: ETL


The next phase in your security planning process is to consider security in what we call the
processing layer—that is, during SSIS package implementation and execution. Keep in mind
that SSIS packages are executable files, which are designed, of course, with a specific pur-
pose. That purpose (when building out a BI solution) is to extract, transform, and load data
in preparation for building SSAS cubes and mining models. We do see customers using SSIS
for non-BI purposes, such as data migration, but we won’t cover those other types of cases in
this discussion.

Security in SSIS is layered, starting with the packages themselves. Packages include several
security options. These are related to the ability to read, write, and execute the packages.
Using SSMS, you can view (or change) the roles associated with read/write ability for each
package. Figure 4-7 shows the dialog box that enables you to do this.

FIgure 4-7 SSIS packages have associated read and write permissions.
98 Part I Business Intelligence for Business Decision Makers and Architects

When you create SSIS packages in BIDS, each package has some default security settings
assigned to sensitive package properties, such as connection strings. You can change these
default encryption settings as your security requirements necessitate. For example, these set-
tings include the ProtectionLevel property value, and (optionally) assigning a password to the
package. The default setting for the ProtectionLevel option is EncryptSensitiveWithUserKey.
All possible settings for this option are shown in Figure 4-8.

FIgure 4-8 ProtectionLevel SSIS package option settings

It’s important that you set and follow a regular security standard for SSIS packages at the
time development begins for your project. We’ve really just scratched the surface of SSIS
security. An example of another security option available for SSIS packages is verification
of integrity via association with digital signatures. Yet another security consideration is how
you will choose to store connection string information used by your packages. Some options
for doing this include storing connection string information within the package or storing it
externally (such as in a text file, XML file, and so on) We’ll cover this topic and other advanced
SSIS security topics more thoroughly in Chapter 18, “Deploying and Managing Solutions in
Microsoft SQL Server 2008 Integration Services,” which covers SSIS in depth.

One final thought before we end this introduction to SSIS security is related to where you
choose to store your SSIS packages. Packages can be stored in SQL Server itself or in a folder
on the file system. We’ll cover the exact locations in more detail in just a bit. For security pur-
poses—particularly if you choose to store your packages in folders—you should remember to
appropriately apply an access control list (ACL) to those folders.

SSAS Data
After the data is associated with SSAS, you have a new set of security considerations in this
key service for BI. The first of these is the permissions associated with the logon account for
the SSAS service itself. As with any service, the SSAS logon account should be configured
using the security principle of least privilege.

The next consideration for security is at the level of a BIDS solution. As in Visual Studio, top-
level containers are called solutions and represented as *.sln files. Solutions contain one or
more projects. Inside of each BIDS solution, you’ll set up one or more data source objects.
Each of these objects represents a connection string to some type of data that will be used
as a source for your SSAS objects. Each data source requires specific connection credentials.
As with any type of connection string, you supply these credentials via the dialog box pro-
vided. In addition to supplying these credentials, you must also supply the impersonation
Chapter 4 Physical Architecture in Business Intelligence Solutions 99

information for each data source object. Using the SSAS service account is the default option.
You can also choose to use a specific Microsoft Windows account, to use the credentials of
the current user, or to inherit credentials. This step is shown in Figure 4-9.

FIgure 4-9 Impersonation Information options when connecting to an SSAS 2008 cube

So which one of these options should you use? At this point, we simply want to present you
with the options available. As we drill deeper into working with BIDS, we’ll arm you with
enough information to answer this question in the way that best supports your particular
project requirements.

Here’s a bit more information about the impersonation option as well. It might help you to
understand the answer to the following question: “When does impersonation get used rather
than the credentials supplied in the connection string?” SQL Server Books Online includes this
guidance:

The specified credentials will be used for (OLAP cubes) processing, ROLAP queries,
out-of-line bindings, local cubes, mining models, remote partitions, linked objects,
and synchronization from target to source. For DMX OPENQUERY statements,
however, this option is ignored and the credentials of the current user will be used
rather than the specified user account.

You’ll have to plan for and implement many other security considerations for SSAS as your
project progresses. As with the SSIS processes discussed earlier, our aim is to get you started
with baseline security. We’ll go in to detail about other considerations, including SSAS roles
and object-specific permissions, in Part II, “Microsoft SQL Server 2008 Integration Services for
Developers,” which focuses on SSAS implementation.
100 Part I Business Intelligence for Business Decision Makers and Architects

On the User Client: Excel or SSRS


The last consideration in your core security architecture for your BI development environ-
ment is how thoroughly you want to build your client environment. It has been our experi-
ence that the complete client environment is often fully known at this phase of BI projects.
Usually, the environment includes, at a minimum, SSRS and Excel. Other clients that we’ve
seen included at this point are Office SharePoint Server 2007 and PerformancePoint Server.

In this section, we’ll focus first on security in Excel 2007 for SSAS objects—focusing on OLAP
cubes and then data mining structures. After we briefly discuss best practices on that topic,
we’ll take a look at core security for SSRS. A more complete discussion of SSRS security is
included in Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for
Business Intelligence,” which is devoted entirely to implementation details of SSRS.

Excel as an OLAP Cube Client As we saw in previous chapters, connecting Excel 2007 as
a client of SSAS 2008 OLAP cubes via Excel’s PivotTable view is a straightforward operation.
This is, of course, by design. What is the best practice for baseline security in this situation?
Note that in Figure 4-10 you can see the default configuration, which is to use the currently
logged-on Windows user account to authenticate to the OLAP cube data source. Note also
that the connection string is stored as a file on the client. As a baseline measure, you should
use the principle of least privilege, which means that, at the very least, a non–administrator
account should be used to connect. Also, if you’re passing a specific user name and password
into the connection string, be aware that the credential information is stored in plain text in
the local connection string file.

FIgure 4-10 Connection properties in Excel 2007 when connecting to an SSAS 2008 cube
Chapter 4 Physical Architecture in Business Intelligence Solutions 101

You might also want to enable auditing using one of the available auditing tools, such as SQL
Profiler, if you’re granting access via Excel to OLAP cubes that are built using any production
data.

Excel as a Data Mining Client If you plan to use Excel as a client for SSAS data mining
structures, you must first install the SQL Server 2008 Data Mining Add-ins for Office 2007.
After you install the add-ins, you must run the included Server Configuration Utility to set up
the baseline configuration between Excel 2007 and SSAS 2008 data mining. This consists of
the following steps:

1. Specify the SSAS server name. (The connecting user must have permission on SSAS to
create session mining models.)
2. Specify whether you want to allow the creation of temporary mining models.
3. Specify whether you’d like to create a new database to hold information about users of
your mining models.
4. Specify whether to give users other than administrators permission to create perma-
nent mining models in SSAS by using the Excel interface.

You can rerun this wizard to change these initial settings if you need to.

After you complete the initial configuration, use the Connection group on the Data Mining
tab of the Excel 2007 Ribbon to configure session-specific connection information. Clicking
the connection button starts a series of dialog boxes identical to the ones you use to config-
ure a connection to an SSAS cube for an Excel PivotTable view. Figure 4-11 shows the Data
Mining tab, with the connection button showing that the session has an active connection to
an SSAS server (the AdventureWorksDW2008 database).

FIgure 4-11 Connection button in Excel 2007 showing an active connection

The new trace utility called Tracer, which is included with the Data Mining Add-ins, allows you
to see the connection string information. It also allows you to see the generated DMX code
when you work with SSAS data mining structures using Excel. The connection information in
Tracer is shown in Figure 4-12.
102 Part I Business Intelligence for Business Decision Makers and Architects

FIgure 4-12 Connection information shown in the Tracer utility in Excel 2007

SSRS as an OLAP Client Installing SSRS requires you to make several security decisions.
These include components to install, install locations for services, and service account names.
As with the other BI components, we’ll cover SSRS installation in greater detail in Part II.
Another installation complexity involves whether or not you plan to integrate an Office
SharePoint Server 2007 installation with SSRS. Finally, you must choose the environment that
will eventually run the rendered reports, whether it be a browser, Windows Forms, and so on.
We’ll also present different options for security settings.

By default, SSRS is configured to use Windows authentication. You can change this to a cus-
tom authentication module—such as forms or single sign-on (SSO)—if your security require-
ments necessitate such a change. Choosing an authentication mode other than Windows
authentication requires you to go through additional configuration steps during installation
and setup.

A good way to understand the decisions you’ll need to make during SSRS setup is to review
the redesigned Reporting Services Configuration Manager. Here you can set or change the
service account associated with SSRS, the Web service URL, the metadata databases, the
report manager URL, the e-mail settings, an (optional) execution account (used when con-
necting to data sources that don’t require credentials or for file servers that host images used
by reports), and the encryption keys. You also have the option to scale-out deployment loca-
tions. Another important consideration is whether you’ll use Secure Sockets Layer (SSL) to
encrypt all traffic. For Internet-facing SSRS servers, this option is used frequently. You associ-
ate the SSL certificate with the SSRS site using the Reporting Services Configuration Manager
tool as well. All SSRS information generated by this tool is stored in a number of *.config files.
We’ll look at the underlying files in detail in Part IV, which is devoted to SSRS. Figure 4-13
shows this tool.
Chapter 4 Physical Architecture in Business Intelligence Solutions 103

FIgure 4-13 Reporting Services Configuration Manager

After you’ve installed SSRS, as you did with SSAS, you must then determine the connection
context (or authorization strategy) for reports. Figure 4-14 shows a high-level description of
the authentication credential flow. In SSRS, roles are used to associate report-specific permis-
sions with authorized users. You can choose to use built-in roles, or you can create custom
roles for this purpose.

Report
Server
Connection 2 Report Server
Connection 1 Service Accounts Database
User Account Connection 3
External Data
User or Other
Accounts
FIgure 4-14 Authentication flow for SSRS

SSRS permissions are object-specific. That is, different types of permissions are available
for different objects associated with an SSRS server. For example, permissions associ-
ated with data sources, reports, report definitions, and so on differ by object. As a general
best practice, we recommend creating several parking folders on the SSRS service. These
folders should have restricted permissions. Locate all newly deployed objects (connections,
104 Part I Business Intelligence for Business Decision Makers and Architects

reports, and report models) there. Figure 4-15 shows a conceptual rendering of the SSRS
authorization scheme.

Groups and Users

Group
Lisa Miller
Administrators

Roles, Tasks, and Permissions

Publisher

Reports Manage Folders

Permission Permission Permission Permission Permission

Resource Report Folder Resource Report

Items in the Report Server Database

Folder A

Folder B

Report A

Folder C

FIgure 4-15 Conceptual rendering of SSRS security (illustration from SQL Server Books Online)

Security Considerations for Custom Clients


In addition to implementing client systems that “just work” with SSAS objects as data
sources—that is, Excel, SSRS, PerformancePoint Server, and so on—you might also elect to
do custom client development. By “just work” we mean client systems that can connect to
and can display data from SSAS OLAP cubes and data mining models after you configure the
connection string—that is, no custom development is required for their use.
Chapter 4 Physical Architecture in Business Intelligence Solutions 105

The most common type of custom client we’ve seen implemented is one that is Web based.
Figure 4-16 illustrates the architecture available for custom programming of thin-client BI
interfaces. As with other types of Web sites, if you’re taking this approach, you need to con-
sider the type of authentication and authorization system your application will use and, sub-
sequently, how you will flow the credentials across the various tiers of the system.

Browser Browser Browser Other Thin Clients

Web

Internet Information Services (IIS)

ASP ASP.NET ASP, ASP.NET, etc.

COM-Based
Client .NET Client
Win32
Applications for Applications
Applications
OLAP and/or for OLAP
for OLAP
Data Mining and/or
and/or
Data Mining Data Mining

ADO MD

Any
Application
OLE DB for OLAP ADO MD.NET for OLAP and
Data Mining

XMLA
over
TCP/IP

Instance of SQL Server 2008


Analysis Services

FIgure 4-16 Thin clients for BI require additional security.


106 Part I Business Intelligence for Business Decision Makers and Architects

It’s important that you include security planning for authentication and authorization, even at
the early stages of your project. This is because you’ll often have to involve IT team members
outside of the core BI team if you’re implementing custom client solutions. We’ve seen situ-
ations where solutions were literally not deployable into existing production environments
because this early planning and collaboration failed to occur.

As you would expect, if you’re using both OLAP cubes and data mining structures, you have
two disparate APIs to contend with. We have generally found that custom client develop-
ment occurs later in the BI project life cycle. For this reason, we’ll cover more implementation
details regarding this topic in Part IV, which presents the details about client interfaces for
SSAS.

Backup and restore


You might be surprised to see this topic addressed so early in the book. We’ve chosen to
place it here to help you avoid the unpleasant surprises that we saw occur with a couple of
our customers. These scenarios revolve around backups that either aren’t running regularly,
backups that aren’t run at all, those that are incomplete, and, worst of all, those that are unre-
storable. The time to begin regular backup routines for your BI project is immediately after
your development environment is set up.

As with the security solution setup for BI, backup scenarios require that you consider backup
across all the tiers you’ve chosen to install in your development environment. At a minimum,
that will usually consist of SSAS, SSIS, and SSRS, so those are the components we’ll cover here.
For the purpose of this discussion, we’ll assume you already have a tested backup process for
any and all source data that is intended to be fed into your BI solution. It is, of course, impor-
tant to properly secure both your backup and restore processes and data as well.

Backing Up SSAS
The simplest way to perform a backup of a deployed SSAS solution is to use SSMS. Right-click
on the database name in the Object Explorer tree and then click Backup. This generates the
XMLA script to back up the metadata (in XMLA) from the SSAS structures. This metadata will
include all types of SSAS objects in your solution—that is, data sources, data source views,
cubes, dimensions, data mining structures, and so on. Figure 4-17 shows the user interface
in SSAS for configuring backup. Note that in this dialog box you have the option to generate
the XMLA script to perform this backup. XMLA scripting is an efficiency that many BI admin-
istrators use to perform routine maintenance tasks such as regular backups.
Chapter 4 Physical Architecture in Business Intelligence Solutions 107

FIgure 4-17 The SSMS interface for SSAS backups

When you perform backups (and restores), you must have appropriate permissions. These
include membership in a server role for SSAS, full control on each database to be backed up,
and write permission on the backup location.

Note You can also use the Synchronize command to back up and restore. This command
requires that you are working with instances located in two different locations. One scenario is
when you want to move development data to production, for example. A difference between
the backup and synchronize commands is that the latter provides additional functionality. For
instance, you can have different security settings on the source (backup) and destination (syn-
chronization target). This type of functionality is not possible with regular SSAS backup. Also,
synchronization does an incremental update of objects that differ between the source server and
the destination server.

Backing Up SSIS
The backup strategy you use for development packages depends on where you choose to
store those packages. SSIS packages can be stored in three different locations: on SQL Server,
on the file system, or in the SSIS package store.

If you store the packages on SQL Server, backing up the msdb database will include a backup
of your SSIS packages. If you store the packages in a particular folder on the file system,
backing up that folder will back up your packages. If you store the packages in the SSIS
package store, backing up the folders associated with that store will back up the packages
108 Part I Business Intelligence for Business Decision Makers and Architects

contained within it. Although you should back up data sources, data source views, and SSIS
packages, we usually don’t create global data sources and data source views when working
with SSIS, so we only have to consider backing up the SSIS packages themselves.

Backing Up SSRS
There are several parts to consider when setting your baseline SSRS backup strategy:

■■ Back up the RS databases Back up the reportserver and reportservertempdb data-


bases that run on a SQL Server Database Engine instance. You cannot back up the
reportserver and reportservertempdb databases at the same time, so schedule your
backup jobs on SQL Server to run at different times for these two databases.
■■ Back up the RS encryption keys You can use the Reporting Services Configuration
Manager to perform this task.
■■ Back up the RS configuration files As mentioned, SSRS configuration settings are
stored in XML configuration files. These should be backed up. They are Rsreport-
server.config, Rssvrpolicy.config, Rsmgrpolicy.config, Reportingservicesservice.exe.con-
fig, Web.config for both the Report Server and Report Manager ASP.NET applications,
and Machine.config for ASP.NET. For default locations, see the SQL Server Books Online
topic “Configuration Files (Reporting Services)” at http://msdn.microsoft.com/en-us/
library/ms155866.aspx.
■■ Back up the data files You must also back up files created by Report Designer and
Model Designer. These include report definition (.rdl) files, report model (.smdl) files,
shared data source (.rds) files, data source view (.dsv) files, data source (.ds) files, report
server project (.rptproj) files, and report solution (.sln) files.
■■ Back up administrative script files and others You should back up any script files
(.rss) that you created for administration or deployment tasks. You should also back
up any custom extensions and custom assemblies you are using with a particular SSRS
instance.

Auditing and Compliance


When you begin your BI development efforts, you can choose to work with a copy of pro-
duction data. If you do that, you might have to maintain certain pre-established standards
for data access. Usually, auditing records are used to prove whether these standard have
been met. Some examples are HIPPA compliance (for health care records) and SOX (or
Sarbannes-Oxley, for businesses of a certain size and type). Of course not all industries have
such rigorous requirements. However, as problems such as identity theft grow in scale, pro-
tecting data in all situations becomes increasingly important. We’ve encountered several
unfortunate situations where key company data was compromised—that is, read, copied,
Chapter 4 Physical Architecture in Business Intelligence Solutions 109

altered, or stolen—during our many years as professional database consultants. In more


than one situation, failure to properly protect company data has led to disciplinary action or
even termination for staff who failed to be diligent about data security (which should include
access auditing). Because BI projects usually include very large sets of data, we see proper
attention being paid to auditing and compliance to be a key planning consideration from the
earliest stages of every project.

A great tool to use for access auditing is SQL Server Profiler. This tool has two main uses:
compliance (via access auditing) and performance monitoring. In this section, we’re talking
only about the first use. Later in your project cycle, you’ll use SQL Server Profiler for perfor-
mance tuning. We find that SQL Server Profiler is underused or even not used at all in BI proj-
ects, and this is unfortunate when you consider the business value the tool can bring to your
project. Figure 4-18 shows SQL Server Profiler in action, with a live trace capturing an MDX
query.

FIgure 4-18 SQL Server Profiler traces capture SSAS activity such as MDX queries.

To use SQL Server Profiler to capture activity on SSAS cubes and mining models, you need to
perform these steps:

1. Connect to SSAS using an account that is a member of the Analysis Services


Administrators server role. You do not have to be a local machine administrator to run
SQL Server Profiler.
2. Select File, New Trace to create a new trace. Select the default trace definition in the
New Trace dialog box. You might want to add more counters in the trace definition,
such as those which capture events of interest. For security compliance, this at least
includes logon activity. You can also choose to monitor the success or failure of permis-
sions in accessing particular statements or objects.
110 Part I Business Intelligence for Business Decision Makers and Architects

3. After selecting the objects to include in your trace definition in the New Trace dialog
box, you can choose to see data structures of interest by clicking the Filter button in
the New Trace Definition dialog box and then typing the name of those structures
in the text box below the LIKE filter. Keep in mind that you can capture activity on
source SQL Server OLTP systems, as well as on SSAS cubes and mining models using
SQL Server Profiler trace definitions.
4. You can continue to set filters on the trace as needed. We like to filter out extrane-
ous information by entering values in the text box associated with the NOT LIKE trace
definition filter. Our goal is always to capture only the absolute minimum amount of
information. SQL Server Profiler traces are inclusive by nature and can quickly become
huge if not set up in a restrictive way. Huge traces are unproductive for two reasons.
You probably don’t have the time to read through huge traces looking for problems.
Also, capturing large amounts of information places a heavy load on the servers where
tracing is occurring. We’ve seen, even in development environments, that this load can
slow down performance significantly. Figure 4-19 shows the events you can select to
monitor related to security for SSAS.

FIgure 4-19 SQL Server Profiler traces can capture security audit events.

Note A side benefit of using SQL Server Profiler during the early phases of your BI project (in
addition to helping you maintain compliance with security standards) is that your team will be-
come adept at using the tool itself. We’ve seen many customers underuse SQL Server Profiler (or
not use it at all) during the later performance-tuning phases of a project. The more you and your
developers work with SQL Server Profiler, the more business value you’ll derive from it.

Another aspect of SQL Server Profiler traces more closely related to performance monitoring
is that trace activity can be played back on another server. This helps with debugging.
Chapter 4 Physical Architecture in Business Intelligence Solutions 111

Auditing Features in SQL Server 2008


SQL Server 2008 has several new features related to auditing and compliance. If you’re using
SQL Server 2008 as a one of the source databases in your BI solution, you can choose to
implement additional security auditing on this source data of your solution in addition to any
auditing that might already be implemented on the data as it is being consumed by SSIS,
SSAS, or SSRS. This additional layer of auditing might be quite desirable, depending on the
security requirements you’re working with.

One example of the auditing capabilities included in SQL Server 2008 OLTP data is the new
server-level audit policies. These can be configured by using SSMS or by using newly added
Transact-SQL commands for SQL Server 2008. Figure 4-20 shows a key dialog box for doing
this in SSMS.

FIgure 4-20 The new server audit dialog box in SQL Server 2008

Source Control
A final consideration for your development environment is how you will handle source con-
trol for all the code your team generates. Keep in mind that the types of code you must
maintain vary depending on which component of the SQL Server 2008 BI stack you’re work-
ing with. As with any professional development project, if your developer team consists of
multiple people, you should select and use a production-quality source control system. We
112 Part I Business Intelligence for Business Decision Makers and Architects

use Visual Studio Team System Team Foundation Server most often for this purpose. There
are, of course, many source code solutions available on the market. The obvious approach
is to pick one and to get everyone on the team to use it. The type of code your team will be
checking in to the selected source system will vary depending on which component or com-
ponents of the BI solutions they are working on.

For SSAS, you’ll be concerned with XMLA for OLAP cube and data mining model metadata,
as well as MDX for OLAP cubes (queries or expressions) or DMX for data mining structures
(queries or expressions).

For SSIS, you’ll be managing SSIS packages. These packages also consist of an XML dialect.
Packages can also contain code, such as VBScript, C#, and so on. You might also be con-
cerned with backing up dependent external files that contain package configuration infor-
mation. These files are also in an XML format. Another source control consideration for SSIS
is storage location. Your choices (built into the BIDS interface) are to store packages in SQL
Server (which stores them in the syssispackages tables in the msdb database), on the file sys-
tem (in any folder you specify), or in the SSIS package store (which is a set of folders named
File System and msdb). If you choose to store packages in msdb, you can use SQL Server
Backup to back up that database, which will include the files associated with SSIS packages
stored there. These options are shown in Figure 4-21.

FIgure 4-21 You have three options for where to store SSIS packages.

Finally, when you are planning your SSIS backup, you should choose to back up folders that
are referenced in the SSIS configuration file, MsDtsSrvr.ini.xml. This file lists the folders on the
server that SSIS monitors. You should make sure these folders are backed up.

For SSRS, you’ll have RDL code for report definitions and report models. These are .rdl
and .smdl files. Also, SSRS stores metadata in SQL databases called reportserver and
Chapter 4 Physical Architecture in Business Intelligence Solutions 113

reportservertempdb. You’ll also want to back up the configuration files and the encryption
key. See the topic “Back Up and Restore Operations for a Reporting Services Installation” in
SQL Server Books Online.

When you’re conducting SSAS development in BIDS, your team has two possible modes of
working with SSAS objects—either online (which means live or connected to deployed cubes
and mining models) or project mode (which means disconnected or offline). In online mode
all changes made in BIDS are implemented in the solution when you choose to process
those changes. If you have created invalid XMLA using the BIDS designer, you could break
your live OLAP cubes or data mining structures by processing invalid XMLA. BIDS will gener-
ate a detailed error listing showing information about any breaking errors. These will also
be indicated in the designer by red squiggly lines. Blue squiggly lines indicate AMO design
warnings. These errors will not break an SSAS object (although they do indicate nonstandard
design, which could result in suboptimal performance).

In project mode, XMLA is generated by your actions in the BIDS designer. You must deploy
that XMLA to a live server to apply the changes. If you deploy breaking changes (that is, code
with errors), the deploy step might fail. In that case, the process dialog box in BIDS will dis-
play information about the failed deployment step (or steps). It is important for you to under-
stand that when you deploy changes that you’ve created offline, the last change overwrites
all intermediate changes. So if you’re working with a team of SSAS developers, you should
definitely use some sort of formal third-party source control system to manage offline work
(to avoid one developer inadvertently overwriting another’s work upon check-in).

Figure 4-22 shows the online mode indicator in the BIDS environment title bar. Figure 4-23
shows the project (offline) mode indicator. Note that online mode includes both the SSAS
project and server name (to which you are connected), and that project mode shows only the
solution name.

FIgure 4-22 BIDS online mode includes the SSAS server name.

FIgure 4-23 BIDS project mode shows only the SSAS solution name.
114 Part I Business Intelligence for Business Decision Makers and Architects

Summary
In this chapter, we got physical. We discussed best practices for implementing an effective
development environment for BI projects. Our discussion included best practices for survey-
ing, preparing, planning, and implementing both physical servers and logical services. We
included a review of security requirements and security implementation practices. We also
discussed preparatory steps, such as conducting detailed surveys of your existing environ-
ment, for setting up your new production environment.

In the next chapter, we turn our attention to the modeling processes associated with BI
projects. We look in detail at architectural considerations related to logical modeling. These
include implementing OLAP concepts such as cubes, measures, and dimensions by using SQL
Server Analysis Services as well as other tools. We’ll also talk about modeling for data mining
using Analysis Services. Appropriate logical modeling is one of the cornerstones to imple-
menting a successful BI project, so we’ve devoted an entire chapter to it.
Chapter 5
Logical OLAP Design Concepts
for Architects
Correctly modeled and implemented OLAP cubes are the core of any Microsoft SQL Server
2008 business intelligence (BI) project. The most successful projects have a solid founda-
tion of appropriate logical modeling prior to the start of developing cubes using SQL Server
Analysis Services (SSAS). Effective OLAP modeling can be quite tricky to master, particularly if
you’ve spent years working with relational databases. The reason for this is that proper OLAP
modeling is often the exact opposite of what you had been using in OLTP modeling. We’ve
seen over and over that unlearning old habits can be quite challenging. One example of how
OLAP modeling is the opposite of OLTP modeling is the deliberate denormalization (duplica-
tion of data) in OLAP models, a practice that stands in contrast to the typical normalization in
OLTP models. We’ll explore more examples throughout this chapter.

In this chapter, we’ll explore classic OLAP cube modeling in depth. This type of modeling is
based on a formalized source structure called a star schema. As previously mentioned, this
type of modeling is also called dimensional modeling in OLAP literature. We’ll take a close
look at all the parts of this type of database schema, and then we’ll talk about different
approaches to get there, including model first or data first. We’ll also discuss the real-world
cost to your project of deviating from this standard design.

Because data mining is another key aspect of BI solutions, we’ll devote two chapters (Chapter
12, “Understanding Data Mining Structures,” and Chapter 13, “Implementing Data Mining
Structures”) to it. There we’ll include information about best practices regarding the use of
logical modeling as a basis for building data mining structures. Modeling for data mining is
quite different than the modeling techniques you use in OLAP cube modeling. We believe
that most BI projects include OLAP and data mining implementations and the appropriate
time to model for both is at the beginning of the BI project. We’ll start, however, with OLAP
modeling because that is where most readers of this book will start the modeling process for
their BI project.

Designing Basic OLAP Cubes


Before we begin our discussion of OLAP modeling specifically, let’s take a minute to discuss
an even more fundamental idea for your BI project. Is a formalized method (that is, star
schema) strictly required as a source schema for an OLAP cube? The technical answer to
this question is “No.” Microsoft purposely does not require you to base OLAP cubes only on
source data—that is, in a star schema format. In other words, you can create cubes based on

115
116 Part I Business Intelligence for Business Decision Makers and Architects

OLTP (or relational and normalized data) or almost any data source in any format that you
can connect to via the included SSAS data source providers. SSAS 2008 is designed with a
tremendous amount of flexibility with regard to source data. This flexibility is both a blessing
and a curse.

If logical OLAP modeling is done well, implementing cubes using SSAS can be very straight-
forward. If such modeling is done poorly, cube implementation can be challenging at best
and counterproductive at worst. We’ve seen many BI projects go astray at this point in devel-
opment. It’s physically possible to use SSAS to build an OLAP cube from nearly any type of
relational data source. But this is not necessarily a good thing!

Particularly if you are new to BI (using OLAP cubes), you’ll probably want to build your first
few projects by taking advantage of the included wizards and tools in Business Intelligence
Development Studio (BIDS). These timesavers are designed to work using traditional star
schema source data. If you intend to go this route, and we expect that most of our readers
will, it’s important to understand dimensional modeling and to attempt to provide SSAS with
data that is as close to this format as possible.

You might be wondering why Microsoft designed BIDS wizards and tools this way but doesn’t
require star schemas as source data. The answer is that Microsoft wants to provide OLAP
developers with a large amount of flexibility. By flexibility, we mean the ability for develop-
ers to create cubes manually. What this means in practice is that for the first few attempts,
you’ll fare much better if you “stick to the star.” After you become more experienced at cube
building, you’ll probably find yourself enjoying the included flexibility to go “beyond the star”
and build some parts of your BI solution manually. Another way to understand this important
concept is to think of a star schema as a starting point for all cube projects, after which you
have lots of flexibility to go outside of the normal rigid star requirements to suit your particu-
lar business needs. What this does not mean, however, is that you can or should completely
ignore dimensional modeling—and, yes, we’ve seen quite a few clients make this mistake!
Assuming we’ve captured your interest, let’s take a closer look at dimensional modeling for
OLAP cubes.

Star Schemas
Key and core to OLAP logical modeling is the idea of using at least one (and usually more)
star schema source data models. These models can be physical structures (relational tables) or
logical structures (views). What we mean is that the model itself can be materialized (stored
on disk), or it can be created via a query (normally called a view) against the source data.
We’ve seen clients have success with both approaches. Sometimes we see a combination of
these approaches as well—that is, some physical storage and some logical definition.

Despite the instances of success using other implementations, the most common imple-
mentation is to store the star schema source data on disk. Because on-disk storage entails
Chapter 5 Logical OLAP Design Concepts for Architects 117

making a copy of all the source data you intend to use to populate your OLAP cube (which
can, of course, be a huge amount of data), the decision of whether to use physical or logical
source data is not a trivial one. For this reason, we’ll address the advantages and disadvan-
tages of both methods in more detail later in this chapter. At this point, we’ll just say that
the most common reasons to incur physical disk storage overhead are to improve cube load
performance (because simple disk reads are usually faster than aggregated view queries) and
reduce the query load on OLTP source systems.

So then, what is this mysterious star schema source structure? It consists of two types of rela-
tional sources (usually tables), called fact and dimension tables. These tables can be stored in
any location to which SSAS can connect. Figure 5-1 shows a list of the included providers in
SSAS.

Figure 5-1 SSAS providers

For the purposes of our discussion, we’ll reference source data stored in a SQL Server 2008
RDBMS instance in one single database. It is in no way a requirement to use only SQL Server
to store source data for a cube. Source data can be stored in any system for which SSAS
includes a provider. Production situations usually are much more complex in terms of source
data than our simplified example. At this point in our discussion, we want to focus on explain-
ing the concept of dimensional modeling, and simplicity is best.

A star schema consists of at least one fact table and at least one dimension table. Usually,
there are many dimension tables, which are related to one or more fact tables. These two
types of tables each have a particular schema. For a pure star design, the rows in the two
types of tables are related via a direct primary key-foreign key relationship. (Other types of
relationships are possible, and we’ll cover those later in this chapter.) Specifically, primary keys
uniquely identify each row in a dimension table and are related to foreign keys that reside
in the fact table. The term star schema originated from a visualization of this relationship, as
118 Part I Business Intelligence for Business Decision Makers and Architects

illustrated in Figure 5-2. One aspect of the visualization does not hold true—that is, there is
no standard-size star. Rather than the five points that we think of as part of a typical drawing
of a star, an OLAP star schema can have 5, 50, or 500 dimension tables or points of the star.
The key is to think about the relationships between the rows in the two types of tables. We’ve
taken a view (called a data source view) from SSAS in BIDS and marked it up so that you can
visualize this structure for yourself.

Figure 5-2 Visualization of a star schema

Next we’ll take a closer look at the particular structures of these two types of tables—fact
and dimension.

Fact Tables
A star schema fact table consists of at least two types of columns: key columns and fact (or
measure) columns. As mentioned, the key columns are foreign-key values that relate rows
in the fact table to rows in one or more dimension tables. The fact columns are most often
numeric values and are usually additive. These fact columns express key business metrics. An
example of this is sales amount or sales quantity.
Chapter 5 Logical OLAP Design Concepts for Architects 119

Note Facts or measures, what’s the difference? Technically, facts are individual values stored in
rows in the fact table and measures are those values as stored and displayed in an OLAP cube. It’s
common in the OLAP literature to see the terms facts and measures used interchangeably.

An example fact table, called FactResellerSales (from the AdventureWorksDW2008 sample), is


shown in Figure 5-3. As mentioned previously, some of the example table names might vary
slightly as we used community technology preview (CTP) samples, rather than the release to
manufacturing (RTM) version, in the early writing of this book.

It’s standard design practice to use the word fact in fact table names, either at the beginning
of the name as shown here (FactResellerSales) or at the end (ResellerSalesFact). In Figure 5-3,
note the set of columns all named xxxKey. These columns are the foreign key values. They
provide the relationship from rows in the fact table to rows in one or more dimension tables
and are said to “give meaning” to the facts. These columns are usually of data type int for
SQL Server, and more generally, integers for most RDBMSs. In Figure 5-3, the fact columns
that will be translated into measures in the cube start with the SalesOrderNumber column
and include typical fact columns, such as OrderQuantity, UnitPrice, and SalesAmount.

Foreign key columns

Fact source columns

Other columns

Figure 5-3 Typical star schema fact table.

Fact tables can also contain columns that are neither keys nor facts. These columns are
the basis for a special type of dimension called a degenerate dimension. For example, the
RevisionNumber column provides information about each row (or fact), but it’s neither a key
nor a fact.
120 Part I Business Intelligence for Business Decision Makers and Architects

Exploring the actual data in this sample fact table can yield further insight into fact table
modeling. The SSAS interface in BIDS includes the ability to do this. You can simply right-click
on any source table in the Data Source View section and then select the menu option Explore
Data. Partial results are shown in Figure 5-4. Note the duplication of data in the results in the
CurrencyKey, SalesTerritoryKey, and SalesOrderNumber columns—this is deliberate. It’s clear
when reviewing this table that you only have part of the story with a fact table. This part can
be summed up by saying this: Fact tables contain foreign keys and facts. So what gives mean-
ing to these facts? That’s the purpose of a dimension table.

Figure 5-4 Fact table data consists of denormalized (deliberately duplicated) data.

BIDS contains several ways to explore the data in a source table in a data source view. In
addition to the table view shown in Figure 5-4, you can also choose a pivot table, chart, or
pivot chart view of the data. In the case of fact table data, choosing a different view really
only illustrates the fact that tables are only a part of a star schema.

Figure 5-5 shows a pivot table view of the same fact table we’ve been looking at in this sec-
tion. The view is really not meaningful because facts (or measures) are associated only with
keys, rather than with data. We show you this view so that you can visualize how the data
from the fact and dimension tables works together to present a complete picture.
Chapter 5 Logical OLAP Design Concepts for Architects 121

Figure 5-5 Fact table information in a pivot table view in BIDS

Another important consideration when modeling your fact tables is to keep the tables as
narrow as possible. Another way to say this is that you should have a business justification for
each column you include. The reason for this is that a star schema fact table generally con-
tains a much larger number of rows than any one dimension table. So fact tables represent
your most significant storage space concern in an OLAP solution.

We’ll add a bit more information about modeling fact tables after we take an initial look at
the other key component to a star schema—dimension tables.

Dimension Tables
As we mentioned previously, the rows in the dimension table provide meaning to the rows
in the fact table. Each dimension table describes a particular business entity or aspect of the
facts (rows) in the fact table. Dimension tables are based typically on factors such as time,
customers, and products. Dimension tables consist of three types of columns:

■■ A newly generated primary key (sometimes called a surrogate key) for each row in the
dimension table
122 Part I Business Intelligence for Business Decision Makers and Architects

■■ The original, or alternate, primary key


■■ Any additional columns that further describe the business entity, such as a Customers
table with columns such as FirstName, LastName, and so on

We’ll take a closer look at the sample table representing this information from
AdventureWorksDW2008, which is called DimCustomer and is shown in Figure 5-6. As with
fact tables, the naming convention for dimension tables usually includes the dim prefix as
part of the table name.

Figure 5-6 Customer dimension tables contain a large number of columns.

Dimension tables are not required to contain two types of keys. You can create dimension
tables using only the original primary key. We do not recommend this practice, however. One
reason to generate a new unique dimension key is that it’s common to load data into dimen-
sions from disparate data sources (for example, from a SQL Server table and a Microsoft
Office Excel spreadsheet). Without generating new keys, you have no guarantee of having a
unique identifier for each row.

Even if you are retrieving source data from a single source database initially, it’s an important
best practice to add this new surrogate key when you load the dimension table. The reason
is that business conditions can quickly change—you might find yourself having to modify a
production cube to add data from another source for many reasons (a business merger, an
acquisition, a need for competitor data, and so on). We recommend that you always use sur-
rogate keys when building dimension tables.
Chapter 5 Logical OLAP Design Concepts for Architects 123

Note that this table is particularly wide, meaning it contains a large number of columns. This
is typical for an OLAP design. In your original source data, the customer data was probably
stored in many separate tables, yet here it’s all lumped into one big table. You might be won-
dering why. We’ll get to that answer in just a minute. Before we do, however, let’s take a look
at the data in the sample dimension table using the standard table viewer and the pivot table
viewer for SSAS data source views in BIDS. Figure 5-7 shows the result of exploring the table
view. To see the exact view that’s shown in Figure 5-7, after the Explore Data window opens,
scroll a bit to the right.

Figure 5-7 Dimension table data is heavily denormalized in a dimensional design.

The pivot table view of the dimension data is a bit more interesting than the fact table infor-
mation. Here we show the pivot chart view. This allows you to perform a quick visual valida-
tion of the data—that is, determine whether it looks correct. Figure 5-8 shows this view. The
ability to sample and validate data using the table, pivot table, chart, and pivot chart views in
the data source view in BIDS is quite a handy feature, particularly during the early phases of
your OLAP design. Figure 5-8 shows how you can select the columns of interest to be shown
in the any of the pivot views.
124 Part I Business Intelligence for Business Decision Makers and Architects

Figure 5-8 The pivot viewer included in BIDS for the SSAS data source view

A rule of thumb for dimension table design is that, when in doubt, you should include it—
meaning that adding columns to dimension tables is good design. Remember that this is not
the case with fact tables. The reason for this is that most dimension tables contain a relatively
small number of rows of data compared to a fact table, so adding information does not
significantly slow performance. There are, of course, limits to this. However, most of our cus-
tomers have been quite pleased by the large number of attributes they can associate to key
business entities in dimension tables.

Being able to include lots of attributes is, in fact, a core reason to use OLAP cubes overall—
by this, we mean the ability to combine a large amount of information into a single, simple,
and fast queryable structure. Although most limits to the quantity of dimensional attributes
you can associate to a particular business entity have been removed in SSAS 2008, you do
want to base the inclusion of columns in your dimension tables on business needs. In our
real-world experience, this value is usually between 10 and 50 attributes (or columns) per
dimension.
Chapter 5 Logical OLAP Design Concepts for Architects 125

Dimensional information is organized a bit differently than you see it here after it has been
loaded into an OLAP cube. Rather than simply using a flat view of all information from all
rows in the source dimension table, OLAP cubes are generally built with dimensional hierar-
chies, or nested groupings of information. We’ll look more closely at how this is done later in
this chapter.

Denormalization
Earlier, we mentioned the difficulty that we commonly see regarding understanding and
implementing effective OLAP modeling, particularly by those with a lot of traditional data-
base (OLTP) experience. The concept of denormalization is key to understanding this. Simply
put, denormalization is the deliberate duplication of data to reduce the number of entities
(usually tables) that are needed to hold information.

Relational data is generally modeled in a highly normalized fashion. In other words, data
is deliberately not duplicated. This is primarily done to facilitate quick data manipulation,
via INSERT, UPDATE, and DELETE operations. The fewer times a data item appears in a set
of tables, the faster these operations run. Why then is an OLAP source denormalized? The
answer lies in the structure of cubes themselves.

Although it’s difficult to visualize, an OLAP cube is an n-dimensional structure that is com-
pletely denormalized. That is, it’s one big storage unit, sort of similar conceptually to a huge
Excel PivotTtable view. Another way to think about this is that an OLAP cube has significantly
fewer joins than any RDBMS system. This is one of the reasons that OLAP cube query results
are returned so much more quickly than queries to RDBMS systems that include many joins.
This is the exact opposite of the classic RDBMS database, which consists of lots of disparate
tables. And this is the crux of OLAP modeling—because, of course, you’ll be extracting data
from normalized OLTP source systems. So just exactly how do you transform the data and
then load the transformed data into OLAP cubes?

Note If you’re interested in further discussions about normalization and denormalization, go


to the following Web site for an interesting discussion of these two different database modeling
styles: http://www.devx.com/ibm/Article/20859.

Back to the Star


There are additional options to OLAP modeling (that is, using table types other than fact
tables and star dimension tables) that we’ll discuss later in this chapter, but the basic con-
cept for OLAP modeling is simply a fact table and some related dimension tables. Figure 5-9
shows a conceptual star schema. Note that we’ve modeled the dimension keys in the pre-
ferred way in this diagram—that is, using original and new (or surrogate) keys.
126 Part I Business Intelligence for Business Decision Makers and Architects

Figure 5-9 Conceptual view of a star schema model

So why create star schemas? As discussed earlier, the simplest answer is that star schemas
work best in the BIDS development environment for SSAS. Although it’s possible to create a
cube from OLTP (or normalized) source data, the results will be not optimal without a great
deal of manual work on your part. In general, we do not recommend this practice. Also, it’s
common to discover flawed data in source systems during the life cycle of a BI project. We
find that usually at least some source data needs to be part of a data-cleansing and valida-
tion process. This validation is best performed during the extract, transform, and load (ETL)
phase of your project, which occurs prior to loading OLAP cubes with data.

With the 2008 release of SSAS, Microsoft has improved the flexibility of source structures for
cubes. What this means is that you start with series of star schemas and then make adjust-
ments to your model to allow for business situations that fall outside of a strict star schema
model. One example of this improved flexibility is the ability to base a single cube on mul-
tiple fact tables. Figure 5-10 shows an example of using two fact tables in an OLAP star
schema. This type of modeling is usually used because some, but not all, dimensions relate to
some, but not all, facts. Another example of this is the need to allow null values to be loaded
into a dimension or fact table and to define a translation of those nulls into a non-null value
that is understandable by end users—for example, “unknown.”

In many OLAP products (including versions of SSAS older than 2005), you were limited to
creating one cube per each fact table. So, if you wanted a unified view of your data, you’d be
forced to manually “union” those cubes together. This caused unnecessary complexity and
additional administrative overhead. In SSAS 2008, the SSAS Dimension Usage tab in the cube
designer in BIDS allows you to define the grain of each relationship between the rows in the
various fact tables to the rows in the dimension tables. This improved flexibility now results
in most BI solutions being based on a single, large (or even huge) cube (or view of enterprise
data). This type of modeling reflects the business need to have a single, unified version of
relevant business data. This cube presents the data in whatever level of detail that is mean-
ingful for the particular end user—that is, it can be summarized, detailed, or any combination
Chapter 5 Logical OLAP Design Concepts for Architects 127

of both. This ability to create one view of the (business) “truth” is one of the most compelling
features of SSAS 2008.

Figure 5-10 Conceptual view of a star schema model using two fact tables

To drill into the Dimension Usage tab, we’ll look at Figure 5-11. This is found in BIDS after
you double-click the sample OLAP cube on the second tab. Here the employee dimension
has no relationship with the rows (or facts) in the Internet Sales table (because no employees
are involved in Internet sales), but it does have a relationship with the Reseller Sales table
(because employees are involved in reseller sales). Also, the customer dimension has no rela-
tionship with the rows in the Reseller Sales table because customers are not resellers, but
the customer dimension does have a relationship with the data in the Internet Sales group
(because customers do make Internet purchases). Dimensions common to both fact tables
are products and time (due date, order date, and ship date. Note that the time dimension has
three aliases. Aliasing the same dimension multiple times is called a role-playing dimension.
We’ll discuss this type of dimension in more detail in Chapter 8, “Refining OLAP Cubes and
Dimensions.”
128 Part I Business Intelligence for Business Decision Makers and Architects

Figure 5-11 The Dimension Usage tab in BIDS shows the relationships between the fact and dimension
tables.

When creating a cube, the SSAS Cube Wizard detects the relationships between dimension
and fact table rows and populates the Dimension Usage tab using its best guess by examin-
ing the source column names. You will review and update as needed if the results do not
exactly match your particular business scenarios. When we review building your first cube in
Chapter 7, “Designing OLAP Cubes Using BIDS,” we’ll show more specific examples of updates
you might need to make.

Using Grain Statements


By this point, you should understand the importance and basic structure of a star schema.
Given this knowledge, you might be asking yourself, “So, if the star schema is so important,
what’s the best way for me to quickly and accurately create this model?” In our experience, if
you begin with your end business goals in mind, you’ll arrive at the best result in the quickest
fashion, which will be made of a series of tightly defined grain statements. A grain statement
is a verbal expression of the facts (or measures) and dimensions, expressed at an appropriate
level of granularity. Granularity means level of detail. An example is a time dimension. Time
granularity can be a month, day, or hour. Specifically, well-written grain statements capture
the most detailed level of granularity that is needed for the business requirements. Here are
some examples of simple grain statements:

■■ Show the sales amount and sales quantity (facts or measures) by day, product,
employee, and store location (with included dimensions being time, products, employ-
ees, and geography). The grain of each dimension is also expressed—that is, time
dimension at the grain of each day, products dimension at the grain of each product,
and so on.
Chapter 5 Logical OLAP Design Concepts for Architects 129
■■ Show the average score and quantity of courses taken by course, day, student, man-
ager, curriculum, and curriculum type.
■■ Show the count of offenders by location, offense type, month, and arresting officer.

Well-written grain statements are extremely useful in OLAP modeling. It’s critical that each
of the grain statements be validated by both subject matter experts and business decision
makers prior to beginning the modeling phase. We use a sign-off procedure to ensure that
appropriate validation of grain statements has taken place before we begin any type of
prototype (or production) development work. We consider the sign-off on grain statements
as a key driver in moving from the envisioning to the design and development stages of
work. Here is a sampling of validation questions that we ask to determine (or validate) grain
statements:

■■ What are the key metrics for your business? Some examples for a company that sells
products are as follows: sales amount, sales quantity, gross profit, net profit, expenses,
and so on.
■■ What factors do you use to evaluate those key metrics? For example, do you evaluate
sales amount by customer, employee, store, date, or “what”?
■■ By what level of granularity do you evaluate each factor? For example, do you evaluate
sales amount by day or by hour? Do you evaluate customer by store or by region?

As you can see by the example grain statements listed, BI solutions can be used by a broad
range of organizations. In my experience, the “Show me the sales amount by day” model,
although it’s the most typical, is not the only situation in which BI can prove useful. Some
other interesting projects we’ve worked on included using OLAP cubes to improve decision
support for the following business scenarios:

■■ Improve detection of foster care families not meeting all state requirements. (SSAS data
mining was also used in this scenario.)
■■ Provide a flexible, fast query system to look up university course credits that are trans-
ferable to other universities.
■■ Improve food and labor costs for a restaurant by viewing and acting on both trends
and exception conditions.
■■ Track the use and effectiveness of a set of online training programs by improving the
timeliness and flexibility of available reports.

As we’ve mentioned, when considering why and where you might implement SSAS OLAP
cubes in your enterprise it’s important to think broadly across the organization—that is,
which groups would benefit from an aggregated view of their (and possibly other groups’)
data? BI for everyone via OLAP cubes is one of the core design features of the entire BI suite
of products from Microsoft.
130 Part I Business Intelligence for Business Decision Makers and Architects

Design Approaches
There are two ways to approach the design phase of BI: either schema first or data first.
Schema first means that the OLAP architect designs BI based on business requirements. This
means starting with the careful definition of one or more grain statements. Then empty
OLAP cube structures (star schema tables) are created and mappings to source data are
drawn up. ETL processes connect the source data to the destination locations in this model.
This method is used for large and complex projects. It’s also used for projects that will load
data that must be cleansed and validated prior to loading. And this method is used when the
BI architect does not have administrative rights on source data systems.

In data-first modeling, source data is shaped for load into OLAP cubes. This is often done
via views against this data. If, for example, source data is stored in an RDBMS, views are writ-
ten in SQL. This approach is used for smaller, simpler implementations, often in shops where
the DBA is the BI architect, so she already has administrative control over the source RDBMS
data. Also, the source data must be in a very clean state—that is, it should have already been
validated, with the error data having been removed.

The approach you take depends on a number of factors:

■■ Your skill and experience with OLAP modeling


■■ The number of data sources involved in the project and their complexity
■■ Your access level to the source data

We favor the schema-first approach because we find that it lends itself to cleaner design of a
more standard star-schema type. The resulting OLAP cubes are easier to create and maintain,
and they are generally more usable. Part of this cleanliness is driven by the method—that
is, by designing an empty set of structures first, and then by mapping source data, and then
by cleaning and validating source data prior to loading it into the destination structures. As
stated, the BIDS tools for SSAS, although flexible, are primarily designed to work with source
data that is as close to a classic OLAP (or star schema) format as possible. Given our prefer-
ence for schema-first design, you might wonder which tools we recommend that you use to
perform this task.

Choosing Tools to Create Your OLAP Model


Our OLAP modeling tool of choice is Visio 2007. We like it for its ease of use and availability,
and we especially like it for its ability to quickly generate the Transact-SQL data definition
language (DDL) source statements. This is important so that the design for OLAP cube source
data can be rapidly materialized on your development server. Although we prefer Visio, it’s
also possible to use the BIDS SSAS itself to create an empty star schema model. We’ll detail
the process for this later. If you are already using another modeling tool, such as ERwin, that
will work as well. Just pick a tool that you already know how to use (even if you’ve used it
only for OLTP design) if possible.
Chapter 5 Logical OLAP Design Concepts for Architects 131

You’ll usually start your design by creating dimension tables because much of the dimension
data will be common to multiple grain statements. Common examples of this are time, cus-
tomers, products, and so on. In the example we’ve provided in Figure 5-12, you can see that
there are relatively few tables and that they are highly denormalized (meaning they contain
many columns with redundant data—for example, in StudentDim, the region, area, and big-
Area columns).

Figure 5-12 A model showing single-table source dimensions (star type) and one multitable source
(snowflake type)

Contrast this type of design with OLTP source systems and you’ll begin to understand the
importance of the modeling phase in an OLAP project. In Figure 5-12, each dimension source
table except two (OfferingDim and SurveyDim) are the basis of a single cube dimension.
That is, StudentDim is the basis of the Student dimension, InstructorDim is the basis of the
Instructor dimension, and so on. These are all examples of star dimensions. OfferingDim and
SurveyDim have a primary key-foreign key relationship between the rows. They are the basis
for a snowflake (or multitable sourced) dimension. We’ll talk more about snowflake dimen-
sions later in this chapter. You’ll also notice in Figure 5-12 that each table has two identity (or
key) fields: a newID and an oldID. This is modeled in the preferred method that we discussed
earlier in this chapter.

We’ve also included a diagram (Figure 5-13) of the fact tables for the same project. You can
see that there are nearly as many fact tables as dimension tables in this particular model
example. This is not necessarily common in OLAP model design. More commonly, you’ll use
from one to five fact tables with five to 15 dimension tables, or more, of both types. The rea-
son we show this model is that it illustrates reasons for using multiple fact tables—for exam-
ple, some Session types have facts measured by day, while other Session types have facts
measured by hour. The ability to base a single OLAP cube on multiple fact tables is a valuable
132 Part I Business Intelligence for Business Decision Makers and Architects

addition to SSAS. It was introduced in SQL Server 2005, and we see underutilization of the
multiple-fact table view because of a lack of understanding of star schema structure.

Figure 5-13 A model showing five fact tables to be used as source data for a single SSAS OLAP cube

Other Design Tips


As we mentioned in Chapter 4, “Physical Architecture in Business Intelligence Solutions,”
because your project is now in the developing phase, all documents, including any model
files (such as .vsd models, .sql scripts, and so on), must be under source control if multiple
people will be working on the OLAP models. You can use any source control product or
method your team is comfortable with. The key is to use some type of tool to manage this
process. We prefer Visual Studio Team System Team Foundation Server for source control
and overall project management. We particularly like the new database source control sup-
port, because we generally use SQL Server as a relational source system and as an intermedi-
ate storage location for copies of data after ETL processes have been completed.

As in relational modeling, OLAP modeling is an iterative process. When you first start, you
simply create the skeleton tables for your star schema by providing table names, keys, and
a couple of essential column names (such as first name and last name for customer). Ideally,
these models will be directly traceable to the grain statements you created during earlier
phases of your project. As you continue to work on your design, you’ll refine these models by
adding more detail, such as updating column names, adding data types, and so on.

At this point, we’ll remind you of the importance of using the customer’s exact business ter-
minology when naming objects in your model. The more frequently you can name source
schema tables and columns per the captured taxonomy, the more quickly and easily your
model can be understood, validated, and translated into cubes by everyone working on your
project. We generally use a simple tool, such as an Excel spreadsheet to document customer
taxonomies. We’ve found that taking time to document, validate, and use customer taxono-
mies in BI projects has resulted in a much higher rate of adoption and satisfaction because
it’s intuitive to use and has lower end-user training costs.
Chapter 5 Logical OLAP Design Concepts for Architects 133

Using BIDS as a Designer


To create an empty OLAP cube structure using BIDS, you can use one of two methods. The
first is to create a new cube using the wizard and then select the option of creating a new
cube without using a data source. If using this method, you design all parts of the cube using
the SSAS wizards. We’ll review the SSAS interface, including the wizards, in detail in Chapter 7.

The second method you can use is to build a prototype cube (or a dimension) without using a
data source and to base your object on one or more template files. You can use the included
sample template files to explore this method. The default cube template files are located at
C:\Program Files\Microsoft SQL Server\100\Tools\Templates\olap\1033\Cube Templates. The
default dimension template files are located at C:\Program Files\Microsoft SQL Server\100\
Tools\Templates\olap\1033\Dimension Templates. These files are located in the Program
Files (x86) folder on x64 machines. These files consist of XMLA scripts that are saved with the
appropriate extension—for example, *.dim for dimension files. Later in this book, you’ll learn
how to generate metadata files for cubes and dimensions you design. You can also use files
that you have generated from cubes or dimensions that you have already designed as tem-
plates to design new cubes or dimensions.

For either of these methods, after you complete your design to materialize it into an RDBMS,
you click on the Database menu in BIDS, and then on Generate Relational Schema. This
opens a configurable wizard that allows you to generate the Transact-SQL DDL code to cre-
ate the empty OLAP source schema in a defined instance of SQL Server 2008.

Although the preceding methods are interesting and practical for some (simple) design proj-
ects, we still prefer to use Visio for most projects. The reason is because we find Visio to be
more flexible than using SSAS; however, that flexibility comes with a tradeoff. The tradeoff
is that you must design the entire model from scratch in Visio. Visio contains no guidelines,
wizards, or hints to help you model an OLAP cube using a star-schema-like source. Using
BIDS, you can choose to use wizards to generate an iteration of your model. Then you manu-
ally modify that model. We can understand how use of BIDS for this process would facilitate
rapid prototyping. The key factor in deciding which method to use is your depth of knowl-
edge with OLAP concepts—the BIDS method assumes you understand OLAP modeling; Visio,
of course, does not.

Modeling Snowflake Dimensions


As mentioned previously, SSAS has increased the flexibility of source schemas to more easily
accommodate common business needs that aren’t easily modeled using star schemas. This
section discusses some of those new or improved options in SQL Server 2008 SSAS. We’ll
begin with the most common variation to a standard star schema. We discuss it first because
you’ll probably have occasion to use it in your model. Quite a few people do, and this is the
one we most often see being implemented inopportunely. It’s called a snowflake schema.
134 Part I Business Intelligence for Business Decision Makers and Architects

Snowflake Schemas
A snowflake is a type of source schema used for dimensional modeling. Simply put, it means
basing an OLAP cube dimension on more than one source (relational) table. The most com-
mon case is to use two source tables. However, if two or more tables are used as the basis
of a snowflake dimension, there must be a key relationship between the rows in each of the
tables containing the dimension information.

Note in the example in Figure 5-14 that the DimCustomer table has a GeographyKey in
it. This allows for the snowflake relationship between the rows in the DimGeography and
DimCustomer tables to be detected by the New Cube Wizard in BIDS.

Figure 5-14 A snowflake dimension is based on more than one source table.

The Dimension Usage section of SSAS usually reflects the snowflake relationship you model
when you initially create the cube using the New Cube Wizard (as long as the key columns
have the same names across all related tables). If you need to manually adjust any relation-
ships after the cube has been created, you can do that using tools provided in BIDS.

Figure 5-15 shows the Dimension Usage tab in BIDS, which was shown earlier in the chapter.
This time, we are going to drill down a bit deeper into using it. To adjust, or verify any rela-
tionship, click the build button (the small gray square with the three dots on it) on the dimen-
sion name at the intersection of the dimension and fact tables. We’ll start by looking at a
Regular or star dimension—which, in the screen shot shown, is the ProductKey build button.
Chapter 5 Logical OLAP Design Concepts for Architects 135

Figure 5-15 The Dimension Usage tab allows you to establish the granularity of relationships between
source tables.

After you click the build button, you’ll access the Define Relationship dialog box. There you
can confirm that the relationship BIDS detected during the cube build is correct. If the rela-
tionship or the level of granularity has been incorrectly defined, you can adjust it as well. In
Figure 5-16, you can see that (star or Regular) relationship has been correctly detected—you
validate this by verifying that the correct identifying key columns were detected by BIDS
when the cube was initially created. In this example, using the ProductKey from the Dim
Product (as the primary key) and Fact Internet Sales (as the foreign key) tables reflects the
intent of the OLAP design.

For a snowflake dimension, you review or refine the relationship between the related dimen-
sion tables in the Define Relationship dialog box that you accessed from the Dimension
Usage tab in BIDS. An example is shown in Figure 5-17. Note that the dialog box itself
changes to reflect the modeling needs—that is, you must select the intermediate dimension
table and define the relationship between the two dimension tables by selecting the appro-
priate key columns. You’ll generally leave the Materialize check box selected (the default set-
ting) for snowflake dimensions. This causes the value of the link between the fact table and
the reference dimension for each row to be stored in SSAS. This improves dimension query
performance because the intermediate relationships will be stored on disk rather than calcu-
lated at query time.
136 Part I Business Intelligence for Business Decision Makers and Architects

Figure 5-16 The most typical relationship is one-to-many between fact and dimension table rows.

Figure 5-17 The referenced relationship involves at least three source tables.

When Should You Use Snowflakes?


Because snowflakes add overhead to cube processing time and query processing time, you
should use them only when the business needs justify their use. The reason they add over-
head is that the data from all involved tables must be joined at the time of processing. This
means the data must be sorted and matched prior to being fetched for loading into the
OLAP dimension. Another source of overhead can be on each query to members of that
dimension, depending on whether the relationships are materialized (or stored on disk as
part of the dimension processing) or simply retrieved both at the time of dimension process-
ing and at query time.
Chapter 5 Logical OLAP Design Concepts for Architects 137

The most typical business situation that warrants the use of a snowflake dimension design
is one that reduces the size of the dimension table by removing one or more attributes that
are not commonly used and places that attribute (or attributes) in a separate dimension
table. An example of this type of modeling is a customer dimension with an attribute (or
some attributes) that is used for less than 20 percent of the customer records. Another way
to think about this is by considering the predicted percentage of null values for an attribute.
The greater percentage of nulls predicted, the higher the possibility of creating one or more
separate, but related via key columns, attribute tables. Taking the example further, you can
imagine a business requirement to track any existing URL for a customer’s Web site in a busi-
ness scenario where very few of your customers actually have their own Web sites. By creat-
ing a separate but related table, you significantly reduce the size of the customer dimension
table.

Another situation that might warrant the use of a snowflake design is one in which the
update behavior of particular dimensional attributes varies—that is, a certain set of dimen-
sional attributes should have their values overwritten if updated values become part of the
source data, whereas a different set should have new records written for each update (main-
taining change history). Although it’s possible to combine different types of update behavior,
depending on the complexity of the dimension, it might be preferable to separate these
attributes into different source tables so that the update mechanisms can be simpler.

Tip In the real world, clients who have extensive experience modeling normalized databases
often overuse snowflakes in OLAP scenarios. Remember that the primary goal of the star schema
is to denormalize the source data for efficiency. Any normalization, such as a snowflake dimen-
sion, should relate directly to business needs. Our experience is that usually less than 15 percent
of dimensions need to be presented as snowflakes.

What Other Cube Design Variations Are Possible?


With SSAS 2008, you can use many other advanced design techniques when building your
OLAP cubes. These include many-to-many dimensions, data mining dimensions, and much
more. Because these advanced techniques are not tied to source schema modeling in the
way that modeling dimensions are, we cover these (and other) more advanced modeling
techniques in future chapters.

Why Not Just Use Views Against the Relational Data Sources?
At this point, you might be thinking, “This OLAP modeling seems like a great deal of work.
Why couldn’t I just create views against my OLTP source (or sources) to get the same result?”
Although you are not prevented by restrictions inside of the BIDS SSAS OLAP cube from
doing this, our experience has been that it’s seldom the case that the relational source data
is clean enough to directly model against. Also, these designs seldom perform in an optimal
manner.
138 Part I Business Intelligence for Business Decision Makers and Architects

Microsoft has tried to map out the middle ground regarding design with SQL Server 2008. In
SQL Server 2005 SSAS, Microsoft removed most restrictions that required a strict star source
schema. The results, unfortunately, especially for customers new to BI, were often less than
optimal. Many poorly designed OLAP cubes were built. Although these cubes seemed OK in
development, performance was often completely unacceptable under production loads. New
in SQL Server 2008 is a set of built-in AMO design warnings. These are also called real-time
best practice design warnings. When you attempt to build an OLAP cube in BIDS, these warn-
ings appear if your cube does not comply with a set of known best design practices. These
warnings are guidelines only. You can still build and deploy OLAP cubes of any design. Also,
you can configure (or completely turn off) all these warnings.

Typically, the OLAP model is created and then validated against the grain statements. In the
subsequent steps, source data is mapped to destination locations, and then it’s cleaned. The
validated data is then loaded into the newly created OLAP model via ETL processes.

Although the SSAS does not prevent you from direct building against your source data (with-
out modeling, cleaning, and loading it into a new model), the reality that we’ve seen is that
most organizations’ data simply isn’t prepared to allow for a direct OLAP query against OLTP
source data.

One area where relational views (rather than copying data) are sometimes used in OLAP
projects is as data sources for the ETL. That is, in environments where OLAP models and ETL
engineers are not allowed direct access to data sources, it’s common for them to access the
various data sources via views created by DBAs. Also, the time involved to write the queries
to be used in (and possibly to index) the relational views can be quite substantial.

More About Dimensional Modeling


The Unified Dimensional Model (UDM) is one of the recent key enhancements to SSAS. It
was introduced in SSAS 2005; however, in our experience, it was not correctly understood or
implemented correctly by most customers. A big reason for this was improper modeling. For
that reason, we’ll spend some time drilling down into details regarding this topic.

To start, you need to understand how dimensional data is best presented in OLAP cubes. As
we’ve seen, dimension source data is best modeled in a very denormalized (or deliberately
duplicated) way—most often, in a single table per entity. An example is one table with all
product attribute information, such as product name, product size, product package color,
product subcategory, product category, and so on. So the question is, how is that duplicated
information (in this case, for product subcategories and categories) best displayed in a cube?

The answer is in a hierarchy or rollup. The easiest way to understand this is to visualize it.
Looking at the following example, you can see the rollup levels in the AdventureWorks-
DW2008 cube product dimension sample. Note in Figure 5-18 that individual products
roll up to higher levels—in this case, subcategories and categories. Modeling for hierarchy
Chapter 5 Logical OLAP Design Concepts for Architects 139

building inside of OLAP cubes is a core concept in dimensional source data. Fortunately, SSAS
includes lots of flexibility in the actual building of one or more hierarchies of this source data.
So, once again, the key to easy implementation is appropriate source data modeling.

Figure 5-18 Dimension members are often grouped into hierarchies.

It will also be helpful for you to understand some terminology used in BIDS. Dimension
structure refers to the level (or rollup group) names—in this case, Product Names, Product
Subcategories, and Product Categories. Level names roughly correspond to source dimension
table column names. Attribute relationships are defined in a tab, the Dimension Structure tab,
that is new in SQL Server 2008 BIDS. This tab was added because so many customers defined
these relationships incorrectly in previous versions of the product. Attribute relationships
establish relationships between data in source columns from one or more dimension source
tables. We’ll be taking a closer look at the hows and whys of defining attribute relationships
in greater detail later in this chapter.

In Figure 5-19, you can see the Dimension Structure tab in the dimension designer in BIDS.
There you can see all columns from the source tables in the Data Source View section. In the
Attributes section, you can see all attributes that have been mapped to the dimension. In the
Hierarchies section in the center of the screen, you can see all defined hierarchies.
140 Part I Business Intelligence for Business Decision Makers and Architects

Figure 5-19 The Dimension Structure tab allows you to edit dimension information.

The most important initial OLAP dimension modeling consideration is to make every attempt
to denormalize all source data related to a particular entity, which means that each dimen-
sion’s source data is put into a single table. Typically, these tables are very wide—that is, they
have many columns—and are not especially deep—that is, they don’t have many rows.

An example of a denormalized source structure is a product dimension. Your company might


sell only a couple types of products; however, you might retain many attributes about each
product—for example, package size, package color, introduction date, and so on.

There can be exceptions to the general “wide but not deep” modeling rule. A common
exception is the structure of the table for the customers dimension. If it’s a business require-
ment to capture all customers for all time, and if your organization services a huge customer
base, your customer dimension source table might have millions of rows.

One significant architectural enhancement in Analysis Services 2008 is noteworthy here. That
is, in this version, only dimension members requested by a client tool are loaded into mem-
ory. This behavior is different from SSAS 2000, where upon startup all members of all dimen-
sions were loaded into memory. This enhancement allows you to be inclusive in the design of
dimensions—that is, you can assume that more (attributes) is usually better.
Chapter 5 Logical OLAP Design Concepts for Architects 141

After you’ve created the appropriate source dimension table or tables and populated them
with data, SSAS retrieves information from these tables during cube and dimension pro-
cessing. Then SSAS uses a SELECT DISTINCT command to retrieve members from each
column. When using the Cube Wizard in SSAS 2005, SSAS attempted to locate natural hier-
archies in the data. If natural hierarchies were found, BIDS would add them to the dimen-
sions. This feature has been removed in SSAS 2008. The reason for this is that there are
more steps required to build optimized hierarchies. We’ll discuss this in detail in Chapter 6,
“Understanding SSAS in SSMS and SQL Server Profiler.” Also new to SSAS 2008 is a tab in the
dimension designer, the Attribute Relationships tab, which helps you to better visualize the
attribute relationships in dimensional hierarchies, as shown in Figure 5-20.

Figure 5-20 The Attribute Relationships tab allows you to visualize dimensional hierarchies.

There are a couple of other features in SSAS dimensions you should consider when you are
in the modeling phase of your BI project. These features include the ability to set the default
displayed member for each attribute in a dimension, the ability to convert nulls to a value—
usually, Unknown or 0 (zero)—and the ability to allow duplicate names to be displayed. You
142 Part I Business Intelligence for Business Decision Makers and Architects

should model for these features based on business requirements, so you should capture the
business requirements for defaults, unknowns, and duplicates for each dimension and all of
its attributes during the modeling phase of your project.

When we cover dimension and cube creation later in this book, we’ll take a more detailed
look at all of these considerations—at this point, we are just covering topics that relate to
modeling data. In that vein, there is one remaining important topic—the rate of change of
dimensional information. It’s expected that fact table information will change over time.
By this we mean more rows are normally added to the fact tables. In the case of dimension
information, you must understand the business requirements that relate to new dimension
data and model correctly for them. Understanding slowly changing dimensions is key, so we’ll
take on this topic next.

Slowly Changing Dimensions


To understand slowly changing dimensions (SCD), you must understand what constitutes a
change to dimensional data. What you are looking for, or asking, is what the desired out-
come is when dimension member information is updated or deleted. In OLAP modeling,
adding (inserting) new rows into dimension tables is not considered a change. The only two
cases you must be concerned with for modeling are situations where you must account for
updates and deletes to rows in dimension tables.

The first question to ask of your subject matter experts is this one: “What needs to happen
when rows in the dimension source tables no longer have fact data associated with them?”
In our experience, most clients prefer to have dimension members marked as Not Active at
this point, rather than as Deleted. In some cases, it has been a business requirement to add
the date of deactivation as well. A business example of this is a customer who returned a pur-
chase and has effectively made no purchases.

The next case to consider is what the business requirements for dimension table row value
updates are. A common business scenario element is names (of customers, employees, and so
on). The question to ask here is this one: “What needs to happen when an employee changes
her name, (for example, when women get married)”?

There are four different possibilities in modeling, depending on the answer to the preceding
question. You should verify the business requirements for how requested changes should be
processed for each level of dimension data. For example, when using a Geography dimen-
sion, verify the countries, states, counties, and cities data. Here are the possibilities:

■■ No changes. Any change is an error. Always retain original information.


■■ Last change wins. Overwrite. Do not retain original information.
■■ Retain some history. Retain a fixed number of previous values.
■■ Retain all history. Retain all previous values.
Chapter 5 Logical OLAP Design Concepts for Architects 143

Using the Geography dimension information example, typical requirements are that no
changes will be allowed for country, state, and county. However, cities might change names,
so “Retain all history” could be the requirement. You should plan to document the change
requirements for each level in each dimension during the design stage of your project.

Because dimension change behavior is so important to appropriate modeling and OLAP cube
building, dimension change behavior types have actually been assigned numbers for the dif-
ferent possible behaviors. You’ll see this in the SSIS interface when you are building packages
to maintain the dimension source tables.

Types of Slowly Changing Dimensions


The requirements for dimension row data change processing (for example, overwrite, no
changes allowed, and so on) are the basis for you to model the source tables using standard
slowly changing dimension type behavior modeling. SCD modeling consists of a couple of
things. First, to accommodate some of these behaviors, you’ll need to include additional
columns in your source dimension tables. Second, you’ll see the terms and their associated
numbers, listed in the Slowly Changing Dimension data flow component in the SSIS package
designer. This data flow component contains a Slowly Changing Dimension Wizard, which
is quite easy to use if you’ve modeled the source data with the correct number and type of
columns. Although initially you’ll probably manually populate your OLAP dimensions from
source tables, as you move from development into production, you’ll want to automate this
process. Using SSIS packages to do this is a natural fit.

After you have reviewed the requirements and noted that some dimension rows will allow
changes, you should translate those requirements to one of the solution types in the follow-
ing list. We’ll use an example of a customer name attribute when describing each of these
various ways of modeling dimensional attributes so that you can better understand the con-
trasts between the standard ways of modeling dimensional attributes.

■■ Changing Attribute Type 1 means overwriting previous dimension member val-


ues, which is sometimes also called “last change wins.” This type number is called a
Changing Attribute in the SSIS Slowly Changing Dimension Wizard. You do not need
to make any change to source dimension tables for this requirement. You might want
to include a column to record date of entry, however. An example of this would be a
customer name where you needed only to see the most current name, such as a cus-
tomer’s married name rather than her maiden name.
■■ Historical Attribute Type 2 means adding a single new record (or row value) when the
dimension member value changes. This type is called a Historical Attribute in the SSIS
Slowly Changing Dimension Wizard. You add at least one new column in your source
dimension table to hold this information—that is, the previous name. You might also
be required to track the date of the name change and add another column to store the
start date of a new name.
144 Part I Business Intelligence for Business Decision Makers and Architects

■■ Add Multiple Attributes Type 3 means adding multiple attributes (or column values)
when the dimension member value changes. Sometimes there is a fixed requirement.
For example, when we worked with a government welfare agency, it was required by
law to keep 14 surnames. At other times, the requirement was to keep all possible
values of a surname. In general, Type 3 is not supported in the SSIS Slowly Changing
Dimension Wizard. This doesn’t mean, of course, that you shouldn’t use it in your
dimensional modeling. It just means that you’ll have to do a bit more work to build SSIS
dimension update packages. If maintaining all history on an attribute such as a surname
is your requirement, you need to establish how many values you need to retain and
build either columns or a separate (related) table to hold this information. In this exam-
ple you’d add three columns at minimum, a CurrentFlag, an EffectiveStart date, and an
EffectiveEnd date. This type of modeling can become quite complex, and it should be
avoided if possible.
■■ Fixed This type means that no changes are allowed. Any requested change is treated
as an error and throws a run-time exception when the SSIS package runs. No particular
change to source data is required.

Figure 5-21 shows the Slowly Changing Dimension Wizard that runs when you configure the
Slowly Changing Dimension data flow component when building an SSIS package in BIDS.
Note that you set the change behavior for each column when you configure this wizard. Also
note that, although on the next page of this wizard the default is Fail The Transformation
If Changes Are Detected In A Fixed Attribute, you can turn this default off. An interesting
option in the SSIS SCD task is the ability for you to cascade changes to multiple, related
attributes (which is turned off by default), by selecting the Change All The Matching Records
When Changes Are Detected In A Changing Attribute option. This last option is available to
support highly complex modeling scenarios.

Rapidly Changing Dimensions


In rapidly changing dimensions, the member values change constantly. An example of this is
employee information for a fast-food restaurant chain, where staff turnover is very high. This
type of data should be a very small subset of your dimensional data. To work with this type
of dimension, you’ll probably vary the storage location, rather than implementing any par-
ticular design in the OLAP model itself. Rapidly changing dimensional data storage models
are covered in more detail in Chapter 9, “Processing Cubes and Dimensions.”
Chapter 5 Logical OLAP Design Concepts for Architects 145

Figure 5-21 The Slowly Changing Dimension Wizard in SSIS speeds maintenance package creation for SCDs.

Writeback
Another advanced capability of dimensions is writeback. Writeback is the ability for autho-
rized users to directly update the data members in a dimension using their connected client
tools. These client tools would also have to support writeback, and not all client tools have
this capability. So, if dimension member writeback is a business requirement, you must also
verify that any client tool you intend to make available for this capacity actually includes it.

Dimension writeback changes can include any combination of inserts, updates, or deletes.
You can configure which types of changes are permitted via writeback as well. In our experi-
ence, only a very small number of business scenarios warrant enabling writeback for particu-
lar cube dimensions. If you are considering enabling writeback, do verify that it’s acceptable
given any regulatory requirements (SOX, HIPAA, and so on) in your particular business
environment.

There are some restrictions to consider if you want to enable writeback. The first restriction
applies to the modeling of the dimension data and that is why we are including this topic
here. The restriction is that the dimension must be based on a single source table (meaning it
must use a star schema modeling format). The second restriction is that writeback dimensions
are supported only in the Enterprise edition of Analysis Services (which is available only as
part of the Enterprise edition of SQL Server 2008). Finally, writeback security must be specifi-
cally enabled at the user level. We’ll cover this in more detail in Chapter 9.
146 Part I Business Intelligence for Business Decision Makers and Architects

Tip To review features available by edition for SSAS 2008, go to http://download.mi-


crosoft.com/download/2/d/f/2df66c0c-fff2-4f2e-b739-bf4581cee533/SQLServer%20
2008CompareEnterpriseStandard.pdf.

Understanding Fact (Measure) Modeling


A core part of your BI solution are the business facts you choose to include in your OLAP
cubes. As mentioned previously, source facts become measures after they are loaded into
OLAP cubes. Measures are the key metrics by which you measure the success of your busi-
ness. Some examples include daily sales amount, product sales quantity, net profit and so on.

It should be clear by now that the selection of the appropriate facts is a critical consideration
in your model. As discussed previously, the creation of validated grain statements is the foun-
dation of appropriate modeling of fact tables. The tricky part of modeling fact source tables
is twofold. First, this type of “key plus fact” structure doesn’t usually exist in OLTP source data,
and second, source data often contains invalid data. For this reason most of our custom-
ers choose to use ETL processes to validate, clean, combine, and then load source data into
materialized star schema source fact tables as a basis for loading OLAP cubes. We’ve worked
with more than one customer who wanted to skip this step, saying, “The source data is quite
clean and ready to go.” Upon investigation the customer found that this wasn’t the case. It’s
theoretically possible to use relational views directly against OLTP source data to create vir-
tual fact tables via views; however, we rarely implement this approach because of the reasons
listed previously.

Another consideration is timing. We purposely discussed dimension table modeling before


fact table modeling because of the key generation sequence. If you follow classic dimen-
sional modeling, you’ll generate new and unique keys for each dimension table row when
you populate it from source data. In this case, you’ll next use these (primary) keys from the
dimension source tables as (foreign) keys in the fact table. Obviously, the dimension tables
must load successfully prior to loading the fact table in this scenario.

In addition to (foreign) keys, the fact tables will also contain fact data. In most cases, these
fact data columns consist of data types that are numeric. Also, this data will most often be
aggregated by summing the facts across all levels of all dimensions. There are, however,
exceptions to this rule. A typical case is a cube that captures sales activity. If you take a look
at the measure aggregates in the AdventureWorksDW2008 sample, you’ll see that most mea-
sures simply use the SUM aggregate. However, there are exceptions. For example the Internet
Transaction Count measure uses the COUNT aggregate.

BIDS also includes the ability to group similar measures together. Measure groups are shown
in BIDS using folders that have the same name as the source fact tables. In fact, it is common
that the contents of one measure group are identical to one source fact table. These fact
tables originate from source star schemas. You can also see the data type and aggregation
Chapter 5 Logical OLAP Design Concepts for Architects 147

type for each measure in BIDS. The ability to combine multiple fact tables into a single OLAP
cube was introduced in SSAS 2005 and is a powerful feature for simplifying views of large
amounts of business data. We’ll talk about the implementation of this feature in more detail
in Chapter 7.

The built-in aggregation types available for use in SSAS are listed next. When you are model-
ing fact tables, you should determine which type of aggregation behavior meets your busi-
ness requirements. Be aware that you are not limited to the built-in aggregation types. You
can also create any number of custom aggregations via MDX statements. We’ll be explaining
how to do this in Chapter 11, “Advanced MDX.”

Note in Table 5-1 that the list of built-in aggregate functions contains type information for
each aggregate function. This is a descriptor of the aggregation behavior. Additive means
to roll up the one ultimate total. Semi-additive means to roll up to a total for one or more
particular, designated levels, but not a cumulative total for all levels. An example of a com-
mon use of semi-additive is in the time dimension. It is a common requirement to roll up to
the Year level, but not to roll up to the All Time level–that is, to be able to see measures sum-
marized by a particular year but to not show a measure’s aggregate value rolled up across
all years to a grand total. Non-additive means it does not roll up—that is, that the value dis-
played in the OLAP cube shows only that particular (original) value. Also semi-additive mea-
sures require the Enterprise edition of SSAS.

TABLe 5-1 List of Built-in Aggregations and Type information


Aggregation Type
Sum Additive
Count Additive
Min, Max Semi-additive
FirstChild, LastChild Semi-additive
AverageOfChildren Semi-additive
First(Last)NonEmpty Semi-additive
ByAccount Semi-additive
Distinct Count Non-additive

Note ByAccount aggregation is a type of aggregation that calculates according to the


aggregation function assigned to the account type for a member in an account dimension. An
account dimension is simply a dimension that is derived from a single, relational table with
an account column. The data value in this column is used by SSAS to map the types of accounts
to well-known account types (for example Assets, Balances, and so on) so that you replicate the
functionality of a balance sheet in your cube. SSAS uses these mappings to apply the appropriate
aggregation functions to the accounts. If no account type dimension exists in the measure group,
ByAccount is treated as the None aggregation function. This is typically used if a portion of your
cube is being used as a balance sheet.
148 Part I Business Intelligence for Business Decision Makers and Architects

Calculated vs. Derived Measures


A final consideration for measures is that you can elect to derive measure values when load-
ing data into the cube from source fact tables. This type of measure is called a derived mea-
sure because it’s “derived” or created when the cube is loaded rather than simply retrieved
using a SELECT statement from the source fact tables. Creating derived measures is done via
a statement (Transact-SQL for SQL Server) that is understood by the source database. We
do not advocate using derived measures because the overhead of creating them slows cube
processing times.

Rather than incurring the overhead of deriving measures during cube loads, an alterna-
tive way to create the measure value is to calculate and store the measure value during the
ETL process that is used to load the (relational) fact table rather than the SSAS cube. This
approach assumes you’ve chosen to materialize the star schema source data. By materialize,
we mean that you have made a physical copy of your source data on some intermediate stor-
age location, usually SQL Server. This is opposed to simply creating a logical representation
(or a view) of source data. That way, the value can simply be retrieved (rather than calculated)
during the cube load process.

In addition to derived measures, SSAS supports calculated measures. Calculated measure val-
ues are calculated at query time by SSAS. Calculated measures execute based on queries that
you write against the OLAP cube data. These queries are written in the language required
for querying SSAS cubes, which is MDX. If you opt for this approach, no particular modeling
changes are needed. We’ll review the process for creating calculated measures in Chapter 8.

Other Considerations in BI Modeling


SSAS 2008 supports additional capabilities that might affect the final modeling of your cube
source schemas. In our experience, the majority of you will likely begin your modeling and
design process by using the concepts presented in this chapter. You’ll then load some sample
data into prototype OLAP cubes to validate both the data and the modeling concepts. You’ll
then iterate, refining the cube design by implementing some of the more advanced capabili-
ties of SSAS and by continuing to add data that has been validated and cleansed.

Data Mining
Data mining capabilities are greatly enhanced in SSAS 2008 compared to what was available
in earlier editions of SSAS. There are a now many more sophisticated algorithms. These algo-
rithms have been optimized and client tools, such as Excel, have been enhanced to support
these improvements. A quick definition of data mining is the ability to use included algo-
rithms to detect patterns in the data. For this reason, data mining technologies are some-
times also called predictive analytics. Interestingly, SSAS’s data mining capabilities can be used
Chapter 5 Logical OLAP Design Concepts for Architects 149

with either OLTP or OLAP source data. We’ll cover data mining modeling and implementation
in greater detail in Chapters 12 and 13.

Key Performance Indicators (KPIs)


The ability to create key performance indicators from inside SSAS 2008 cubes is a much-
requested feature. A simple definition of a KPI is a method (usually displayed in an end-user
tool visually) of showing one or more key business metrics. For each metric (such as daily
sales), the current state or value, a comparison to overall goal, a trend over time (positive,
neutral, or negative), and other information can be shown. KPIs are usually displayed via
graphics—that is, red, yellow, or green traffic lights; up arrows or down arrows; and so on.
KPIs are often part of dashboards or scorecards in client interfaces. SSAS OLAP cubes include
built-in tools to facilitate the quick and easy creation of KPIs. We’ll discuss the planning and
implementation of SSAS KPIs in Chapter 8.

Actions, Perspectives, and Translations


SSAS actions give end users the ability to right-click a cell of the cube (using client tools that
support SSAS actions) and to perform some type of defined action, such as passing the value
of the selected cell into an external application as a parameter value and then launching
that application. They are not new to this release; however, there are new types of actions
available.

Perspectives are similar to relational views. They allow you to create named subsets of your
cube data for the convenience of your end users. They require the Enterprise edition of SSAS
2008. Translations give you a quick and easy way to present localized cube metadata to end
users. All of these capabilities will be covered in more detail in Chapter 8.

Source Control and Other Documentation Standards


By the time it’s in the OLAP modeling phase, your BI project will contain many files of many
different types. Although you are in the modeling phase, the files will probably mostly con-
sist of Visio diagrams, Excel spreadsheets, and Word documents. It’s important to establish a
methodology for versioning and source control early in your project. When you move to the
prototyping and developing phase, the number and types of files will increase exponentially.

You can use any tool that works for you and your team. Some possible choices include Visual
Source Safe, Visual Studio Team System, SharePoint Document Libraries, or versioning via
Office. The important point is that you must establish a system that all of your BI team mem-
bers are committed to using early in your BI project life cycle. Also it’s important to use the
right tool for the right job—for example, SharePoint Document Libraries are designed to sup-
port versioning of requirements documents (which are typically written using Word, Excel,
150 Part I Business Intelligence for Business Decision Makers and Architects

and so on), while Visual Source Safe is designed to support source control for OLAP code
files, which you’ll create later in your project’s life cycle.

Another important consideration is naming conventions. Unlike OLTP (or relational) database
design, there are few common naming standards in the world of OLAP design. I suggest that
you author, publish, and distribute written naming guidelines to all members of your BI team
during the requirements-gathering phase of your project. These naming guidelines should
include suggested formats for the following items at a minimum: cubes, dimensions, levels,
attributes, star schema fact and dimension tables, SSIS packages, SSRS reports, or Office
SharePoint Server 2007 pages (SPS) and dashboards.

Summary
In this chapter, we covered the basic modeling concepts and techniques for OLAP cubes in a
BI project. We discussed the idea of using grain statements for a high-level validation of your
modeling work. You learned how best to determine what types of dimensions (fixed, slowly
changing, or rapidly changing) and facts (stored, calculated, or derived) will be the basis for
your cubes. We also discussed the concept of hierarchies of dimensional information.

If you are new to BI, and are thinking that you’ve got some unlearning to do, you are not
alone. We hear this comment quite frequently from our clients. OLAP modeling is not at
all like OLTP modeling, mostly because of the all-prevalent concept in OLAP of deliberate
denormalization. In the next chapter, we’ll show you tools and techniques to move your
design from idea to reality. There we’ll dive into the SSAS interface in BIDS and get you
started building your first OLAP cube.
Part II
Microsoft SQL Server 2008
Analysis Services for Developers

151
Chapter 6
Understanding SSAS in SSMS and
SQL Server Profiler
We covered quite a bit of ground in the preceding five chapters—everything from the key
concepts of business intelligence to concepts, languages, processes, and modeling meth-
odologies. By now, we’re sure you’re quite ready to roll up your sleeves and get to work in
the Business Intelligence Development Studio (BIDS) interface. Before you start, though, we
have one more chapter’s worth of information. In this chapter, we’ll explore the ins and outs
of some tools you’ll use when working with Microsoft SQL Server Analysis Services (SSAS)
objects. After reading this chapter, you’ll be an expert in not only SQL Server Profiler but also
SQL Server Management Studio (SSMS) and other tools that will make your work in SSAS very
productive. We aim to provide valuable information in this chapter for all SSAS developers—
from those of you who are new to the tools to those who have some experience. If you’re
wondering when we’re going to get around to discussing BIDS, we’ll start that in Chapter 7,
“Designing OLAP Cubes Using BIDS.”

Core Tools in SQL Server Analysis Services


We’ll begin this discussion of the tools you’ll use to design, create, populate, secure, and
manage OLAP cubes by taking an inventory. That is, we’ll first list all the tools that are part of
SQL Server 2008. Later in this chapter, we’ll look at useful utilities and other tools you can get
for free or at low cost that are not included with SQL Server 2008. We mention these because
we’ve found them to be useful for production work. Before we list our inventory, we’ll talk a
bit about the target audience—that is, who Microsoft thinks will use these tools. We share
this information so that you can choose the tools that best fit your style, background, and
expectations.

SQL Server 2008 does not install SSAS by default. When you install SSAS, several tools are
installed with the SSAS engine and data storage mechanisms. Also, an SSAS installation
does not require that you install SQL Server Database Engine Services. You’ll probably want
to install SQL Server Database Engine Services, however, because some of the tools that
install with it are useful with SSAS cubes. SQL Server 2008 installation follows the minimum-
installation paradigm, so you’ll probably want to verify which components you’ve installed
before exploring the tools for SSAS. To come up with this inventory list, follow these steps:

1. Run Setup.exe from the installation media.

153
154 Part II Microsoft SQL Server 2008 Analysis Services for Developers

2. On the left side of the Installation Center screen, click Installation and then select New
SQL Server Stand-Alone Installation Or Add Features To An Existing Installation on the
resulting screen.
3. Click OK after the system checks complete, and click the Install button on the next
screen.
4. Click Next on the resulting screen. Then select the Add Features To An Existing Instance
Of SQL Server 2008 option, and select the appropriate instance of SQL Server from the
list. After you select the particular instance you want to verify, click Next to review the
features that have been installed for this instance.

Who Is an SSAS Developer?


Understanding the answer to this question will help you to understand the broad vari-
ety of tools available with SSAS OLAP cubes in SQL Server 2008. Unlike classic .NET
developers, SSAS developers are people with a broad and varied skill set. Microsoft has
provided some tools for developers who are comfortable writing code and other tools
for developers who are not. In fact, the bias in SSAS development favors those who
prefer to create objects by using a graphical user interface rather than by writing code.
We mention this specifically because we’ve seen traditional application developers, who
are looking for a code-first approach, become frustrated with the developer interface
Microsoft has provided.

In addition to providing a rich graphical user interface in most of the tools, Microsoft
has also included wizards to further expedite the most commonly performed tasks. In
our experience, application developers who are familiar with a code-first environment
often fail to take the time to understand and explore the development environments
available for SSAS. This results in frustration on the part of the developers and lost pro-
ductivity on business intelligence (BI) projects.

We’ll start by presenting information at a level that assumes you’ve never worked with
any version of Microsoft Visual Studio or Enterprise Manager (or Management Studio)
before. Even if you have experience in one or both of these environments, you might
still want to read this section. Our goal is to maximize your productivity by sharing our
tips, best practices, and lessons learned.

Note in Figure 6-1 that some components are shared to the SQL Server 2008 instance,
but others install only when a particular component is installed. As mentioned in previ-
ous chapters, SQL Server 2008 no longer ships with sample databases. If you want to install
the AdventureWorks OLTP and OLAP samples, you must download them from CodePlex.
For instructions on where to locate these samples and how to install them, see Chapter 1,
“Business Intelligence Basics.”
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 155

FIgure 6-1 Installed features are shown on the Select Features page.

After you’ve verified that everything you expected to be installed is actually installed in the
particular instance of SSAS, you’re ready to start working with the tools. A variety of tools are
included; however, you’ll do the majority of your design and development work in just one
tool—BIDS.

For completeness, this is the list of tools installed with the various SQL Server components:

■■ Import/Export Wizard Used to import/export data and to perform simple


transformations
■■ Business Intelligence Development Studio Primary SSAS, SSIS, and SSRS develop-
ment environment
■■ SQL Server Management Studio Primary SQL Server (all components) administrative
environment
■■ SSAS Deployment Wizard Used to deploy SSAS metadata (*.asdatabase) from one
server to another
■■ SSRS Configuration Manager Used to configure SSRS
■■ SQL Server Configuration Manager Used to configure SQL Server components,
including SSAS
■■ SQL Server Error and Usage Reporting Used to configure error/usage reporting—
that is, to specify whether or not to send a report to Microsoft
156 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ SQL Server Installation Center New information center, as shown in Figure 6-2, which
includes hardware and software requirements, baseline security, and installed features
and samples
■■ SQL Server Books Online Product documentation
■■ Database Engine Tuning Advisor Used to provide performance tuning recommenda-
tions for SQL Server databases
■■ SQL Server Profiler Used to capture activity running on SQL Server components,
including SSAS
■■ Report Builder 2.0 Used by nondevelopers to design SSRS reports. This is available as
a separate download and is not on the SQL Server installation media.

FIgure 6-2 The SQL Server Installation Center

A number of GUI tools are available with a complete SQL Server 2008 installation. By com-
plete, we mean that all components of SQL Server 2008 are installed on the machine. A full
installation is not required, or even recommended, for production environments. The best
practice for production environments is to create a reduced attack surface by installing only
the components and tools needed to satisfy the business requirements. In addition, you
should secure access to powerful tools with appropriate security measures. We’ll talk more
about security in general later in this chapter, and we’ll describe best practices for locking
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 157

down tools in production environments. For now, we’ll stay in an exploratory mode by install-
ing everything and accessing it with administrator privileges to understand the capabilities of
the various tools.

Because BIDS looks like the Visual Studio interface, people often ask us if an SSAS instance
installation requires a full Visual Studio install. The answer is no. If Visual Studio is not
installed on a machine with SSAS, BIDS, which is subset of Visual Studio, installs. If Visual
Studio is installed, the BIDS project templates install inside of the Visual Studio instance on
that machine.

The core tools you’ll use for development of OLAP cubes and data mining structures in SSAS
are BIDS, SSMS, and SQL Server Profiler. Before we take a closer look at these GUI tools, we’ll
mention a couple of command-line tools that are available to you as well.

In addition to the GUI tools, several command-line tools are installed when you install SQL
Server 2008 SSAS. You can also download additional free tools from CodePlex. One tool
available on the CodePlex site is called BIDS Helper, which you can find at http://www.code-
plex.com/bidshelper. It includes many useful features for SSAS development. You can find
other useful tools on CodePlex as well. We’ll list only a couple of the tools that we’ve used in
our projects:

■■ ascmd.exe Allows you to run XMLA, DMX, or DMX scripts from the command prompt
(available at http://www.codeplex.com/MSFTASProdSamples)
■■ SQLPS.exe Allows you to execute Transact-SQL via the Windows PowerShell command
line—mostly used when managing SQL Server source data for BI projects.

As we mentioned, you’ll also want to continue to monitor CodePlex for new community-
driven tools and samples. Contributors to CodePlex include both Microsoft employees and
non-Microsoft contributors.

Baseline Service Configuration


Now that we’ve seen the list of tools, we’ll take a look at the configuration of SSAS. The sim-
plest way to do this is to use SQL Server Configuration Manager. In Figure 6-3, you can see
that on our demo machine, the installed instance of SSAS is named MSSQLSERVER and its
current state is set to Running. You can also see that the Start Mode is set to Automatic. The
service log on account is set to NT AUTHORITY\Network Service. Of course, your settings
may vary from our defaults.

Although you can also see this information using the Control Panel Services item, it’s recom-
mended that you view and change any of this information using the SQL Server Configura-
tion Manager. The reason for this is that the latter tool properly changes associated registry
settings when changes to the service configuration are made. This association is not neces-
sarily observed if configuration changes are made using the Control Panel Services item.
158 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 6-3 SQL Server Configuration Manager

The most important setting for the SSAS service itself is the Log On (or service) account. You
have two choices for this setting. You can select one of three built-in accounts: Local System,
Local Service, or Network Service. If you do not do that, you can use an account that has
been created specifically for this purpose either locally or on your domain. Figure 6-4 shows
the dialog box in SQL Server Configuration Manager where you set this. Which one of these
choices is best and why?

Our answer depends on which environment you’re working in. If you’re exploring or setting
up a development machine in an isolated domain, or as a stand-alone server, you can use any
account. As we show in Figures 6-3 and 6-4, we usually just use a local account that has been
added to the local administrator’s group for this purpose. We do remind you that this circum-
vention of security is appropriate only for nonproduction environments, however.

FIgure 6-4 The SSAS service account uses a Log On account.

SQL Server Books Online contains lots of information about log-on accounts. You’ll want
to review the topics “Setting Up Windows Service Accounts” and “Choosing the Service
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 159

Account” for details on exactly which permissions and rights are needed for your particular
service (user) account.

We’ll distill the SQL Server Books Online information down a bit because, in practice, we’ve
seen only two configurations. Most typically, we see our clients use either a local or domain
lower-privileged (similar to a regular user) account. Be aware that for SSAS-only installations,
the ability to use a domain user account as the SSAS logon account is disabled. One impor-
tant consideration specific to SSAS is that the service logon account information is used to
encrypt SSAS connection strings and passwords. This is a further reason to use an isolated,
monitored, unique, low-privileged account.

Service Principal Names


What is a Service Principal Name (SPN)? An SPN is a particular type of Domain Name System
(DNS) record. When you associate a service account with SSAS at the time of installation,
an SPN record is created. If your SSAS server is part of a domain, this record is stored in
your domain DNS database. It’s required for some authentication scenarios (particular client
tools). If you change the service account for SSAS, you must delete the original SPN and cre-
ate a new SPN record for DNS You can do this with the setSPN.exe tool available from the
Windows Server Resource Kit.

Here’s further guidance from SQL Server Books Online:

“Service SIDs are available in SQL Server 2008 on Windows Server 2008 and
Windows Vista operating systems to allow service isolation. Service isolation
provides services a way to access specific objects without having to either run in
a high-privilege account or weaken the object’s security protection. A SQL Server
service can use this identity to restrict access to its resources by other services
or applications. Use of service SIDs also removes the requirement of managing
different domain accounts for various SQL Server services.
A service isolates an object for its exclusive use by securing the resource with an
access control entry that contains a service security ID (SID). This ID, referred to as
a per-service SID, is derived from the service name and is unique to that service.
After a SID has been assigned to a service, the service owner can modify the access
control list for an object to allow access to the SID. For example, a registry key in
HKEY_LOCAL_MACHINE\SOFTWARE would normally be accessible only to services
with administrative privileges. By adding the per-service SID to the key’s ACL, the
service can run in a lower-privilege account, but still have access to the key.”

Now that you’ve verified your SSAS installation and checked to make sure the service was
configured correctly and is currently running, it’s time to look at some of the tools you’ll use
to work with OLAP objects. For illustration, we’ve installed the AdventureWorks DW2008
sample OLAP project found on CodePlex, because we believe it’s more meaningful to explore
160 Part II Microsoft SQL Server 2008 Analysis Services for Developers

the various developer surfaces with information already in them. In the next chapter, we’ll
build a cube from start to finish. So if you’re already familiar with SSMS and SQL Server
Profiler, you might want to skip directly to that chapter.

SSAS in SSMS
Although we don’t believe that the primary audience of this book is administrators, we do
choose to begin our deep dive into the GUI tools with SSMS. SQL Server Management Studio
is an administrative tool for SQL Server relational databases, SSAS OLAP cubes and data min-
ing models, SSIS packages, SSRS reports, and SQL Server Compact edition data. The reason
we begin here is that we’ve found the line between SSAS developer and administrator to be
quite blurry. Because of a general lack of knowledge about SSAS, we’ve seen many an SSAS
developer being asked to perform administrative tasks for the OLAP cubes or data mining
structures that have been developed. Figure 6-5 shows the connection dialog box for SSMS.

FIgure 6-5 SSMS is the unified administrative tool for all SQL Server 2008 components.

After you connect to SSAS in SSMS, you are presented with a tree-like view of all SSAS
objects. The top-level object is the server, and databases are next. Figure 6-6 shows this tree
view in Object Explorer. An OLAP database object is quite different than a relational database
object, which is kept in SQL Server’s RDBMS storage. Rather than having relational tables,
views, and stored procedures, an OLAP database consists of data sources, data source views,
cubes, dimensions, mining structures, roles, and assemblies. All of these core object types
are represented by folders in the Object Explorer tree view. These folders can contain child
objects as well, as shown in Figure 6-6 in the Measure Groups folder that appears under a
cube in the Cubes folder.

So what are all of these objects? Some should be familiar to you based on our previous dis-
cussions of OLAP concepts, including cubes, dimensions, and mining structures. These are
the basic storage units for SSAS data. You can think of them as somewhat analogous to rela-
tional tables and views in that respect, although structurally, OLAP objects are not relational
but multidimensional.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 161

FIgure 6-6 Object Explorer lists all SSAS objects in a tree-like structure.

Data sources represent connections to source data. We’ll be exploring them in more detail in
this chapter and the next one. Data source views are conceptually similar to relational views
in that they represent a view of the data from one or more defined data sources in the proj-
ect. Roles are security groups for SSAS objects. Assemblies are .NET types to be used in your
SSAS project—that is, they have been written in a .NET language and compiled as .dlls.

The next area to explore in SSMS is the menus. Figure 6-7 shows both the menu and stand-
ard toolbar. Note that the standard toolbar displays query types for all possible compo-
nents—that is, relational (Transact-SQL) components, multidimensional OLAP cubes (MDX),
data mining structures (DMX), administrative metadata for OLAP objects (XMLA), and SQL
Server Compact edition.

FIgure 6-7 The SSMS standard toolbar displays query options for all possible SQL Server 2008 components.

It’s important that you remember the purpose of SSMS—administration. When you think
about this, the fact that it’s straightforward to view, query, and configure SSAS objects—but
more complex to create them—is understandable. You primarily use BIDS to create OLAP
objects. Because this is a GUI environment, you’re also provided with guidance should you
162 Part II Microsoft SQL Server 2008 Analysis Services for Developers

want to examine or query objects. Another consideration is that SSMS is not an end-user tool.
Even though the viewers are sophisticated, SSMS is designed for SSAS administrators.

How Do I View OLAP Objects?


SSMS includes many object viewers. You’ll see these same viewers built into other tools
designed to work with SSAS, such as BIDS. You’ll also find versions of these viewers built into
client tools, such as Microsoft Office Excel 2007. The simplest and fastest way to explore
cubes and mining models in SSMS is to locate the object in the tree view and then to right-
click on it. For cubes, dimensions, and mining structures, the first item on the shortcut menu
is Browse.

We’ll begin our exploration with the Product dimension. Figure 6-8 shows the results of
browsing the Product dimension. For each dimension, we have the ability to drill down to see
the member names at the defined levels—in this case, at the category, subcategory, and indi-
vidual item levels. In addition to being able to view the details of the particular dimensional
(rollup) hierarchy, we can also select a localization (language) and member properties that
might be associated with one or more levels of a dimension. In our example, we have elected
to include color and list price in our view for the AWC Logo Cap clothing item. These mem-
ber properties have been associated with the item (bottom) level of the product dimension.

FIgure 6-8 The dimension browser enables you to view the data in a dimension.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 163

The viewing options available for dimensions in SSMS include the ability to filter and imple-
ment dimension writeback. Writeback has to be enabled on the particular dimension, and the
connected user needs to have dimension writeback permission to be permitted to use this
action in SSMS.

In addition to being able to view the dimension information, you can also see some of the
metadata properties by clicking Properties on the shortcut menu. Be aware that you’re view-
ing a small subset of structural properties in SSMS. As you would expect, these properties are
related to administrative tasks associated with a dimension. Figure 6-9 shows the general dia-
log box of the Product dimension property page. Note that the only setting you can change
in this view is Processing Mode. We’ll examine the various processing modes for dimen-
sions and the implications of using particular selections in Chapter 9, “Processing Cubes and
Dimensions.”

FIgure 6-9 The Dimension Properties dialog box in SSMS shows administrative properties associated with a
dimension.

In fact, you can process OLAP dimensions, cubes, and mining structures in SSMS. You do this
by right-clicking on the object and then choosing Process on the shortcut menu. Because
this topic requires more explanation, we’ll cover it in Chapter 9. Suffice it to say at this point
that, from a high level, SSAS object processing is the process of copying data from source
locations into destination containers and performing various associated processing actions
on this data as part of the loading process. As you might expect, these processes can be
complex and require that you have an advanced understanding of the SSAS objects before
you try to implement the objects and tune them. For this reason, we’ll explore processing in
Part II of this book.

If you’re getting curious about the rest of the metadata associated with a dimension, you
can view this information in SSMS as well. This task is accomplished by clicking on the short-
cut menu option Script Dimension As, choosing Create To, and selecting New Query Editor
Window. The results are produced as pure XMLA script. You’ll recall from earlier in the book
that XMLA is a dialect of XML.
164 Part II Microsoft SQL Server 2008 Analysis Services for Developers

What you’re looking at is a portion of the XMLA script that is used to define the structure
of the dimension. Although you can use Notepad to create SSAS objects, because they are
entirely based on an XMLA script, you’ll be much more productive using the graphical user
interface in BIDS to generate this metadata script. The reason you can generate XMLA in
SSMS is that when you need to re-create OLAP objects, you need the XMLA to do so. So
XMLA is used to copy, move, and back up SSAS objects. In fact, you can execute the XMLA
query you’ve generated using SSMS. We take a closer look at querying later in this chapter.

Now that you’ve seen how to work with objects, we’ll simply repeat the pattern for OLAP
cubes and data mining structures. That is, we’ll first view the cube or structure using the
Browse option, review configurable administrative properties, and then take a look at the
XMLA that is generated. We won’t neglect querying either. After we examine browsing,
properties, and scripting for cubes and models, we’ll look at querying the objects using the
appropriate language—MDX, DMX, or XMLA.

How Do I View OLAP Cubes?


The OLAP cube browser built into SSMS is identical to the one you’ll be working with in BIDS
when you’re developing your cubes. It’s a sophisticated pivot table–style interface. The more
familiar you become with it, the more productive you’ll be. Just click Browse on the shortcut
menu after you’ve selected any cube in the Object Explorer in SSMS to get started. Doing this
presents you with the starter view. This view includes the use of hint text (such as Drop Totals
Of Detail Field Fields Here) in the center work area that helps you understand how best to
use this browser.

On the left side of the browser, you’re presented with another object browser. This is where
you select the items (or aspects) of the cube you want to view. You can select measures,
dimension attributes, levels, or hierarchies. Note that you can select a particular measure as a
filter from the drop-down list box at the top of this object browser. Not only will this filter the
measures selected, it will also filter the associated dimensions so that you’re selecting from an
appropriate subset as you build your view.

Measures can be viewed in the Totals work area. Dimension attributes, levels, or hierarchies
can be viewed on the Rows, Columns, or Filters (also referred to as slicers) axis. These axes
are labeled with the hint text Drop xxx Fields Here. We’ll look at Filters or Slicers axes in more
detail later in this chapter.

At the top of the browser, you can select a perspective. A perspective is a defined view of an
OLAP cube. You can also select a language. Directly below that is the Filter area, where you
can create a filter expression (which is actually an MDX expression) by dragging and dropping
a dimension level or hierarchy into that area and then completing the rest of the informa-
tion—that is, configuring the Hierarchy, Operator, and Filter Expression options. We’ll be dem-
onstrating this shortly. To get started, drag one or more measures and a couple of dimensions
to the Rows and Columns axes. We’ll do this and show you our results in Figure 6-10.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 165

To set up our first view, we filtered our list by the Internet Sales measure group in the
object browser. Next we selected Internet Sales Amount and Internet Order Quantity as our
measures and dragged them to that area of the workspace. We then selected the Product
Categories hierarchy of the Product dimension and dragged it to the Rows axis. We also
selected the Sales Territory hierarchy from the Sales Territory dimension and dragged it to
the Columns axis.

We drilled down to show detail for the Accessories product category and Gloves subcategory
under the Clothing product category on the Rows axis. And finally, we filtered the Sales
Territory Group information to hide the Pacific region. The small blue triangle next to the
Group label indicates that a filter has been applied to this data. If you want to remove any
item from the work area, just click it and drag it back to the left side (list view). Your cursor
will change to an X, and the item will be removed from the view.

It’s much more difficult to write the steps as we just did than to actually do them! And that is
the point. OLAP cubes, when correctly designed, are quick, easy, and intuitive to query. What
you’re actually doing when you’re visually manipulating the pivot table surface is generat-
ing MDX queries. The beauty of this interface is that end users can do this as well. Gone are
the days that new query requests of report systems require developers to rewrite (and tune)
database queries.

FIgure 6-10 Building an OLAP cube view in SSMS


166 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Let’s add more sophistication to our view. To do this, we’ll use the filter and slicer capabilities
of the cube browser. We’ll also look at the pivot capability and use the built-in common que-
ries. To access the latter, you can simply right-click on a measure in the measures area of the
designer surface and select from a shortcut menu, which presents you with common queries,
such as Show Top 10 Values and other options as well. Figure 6-11 shows our results.

FIgure 6-11 Results of building an OLAP cube view in SSMS

Here are the steps we took to get there.

First we dragged the Promotions hierarchy from the Promotion dimension to the slicer (Filter
Fields) area. We then set a filter by clearing the check boxes next to the Reseller promotion
dimension members. This resulted in showing data associated only with the remaining mem-
bers. Note that the label indicates this as well by displaying the text “Excluding: Reseller.”

We then dragged the Ship Date.Calendar Year hierarchy from the Ship Date dimension; we
set the Operator area to Equal, and in the Filter Expression area we chose the years 2003
and 2004 from the available options. Another area to explore is the nested toolbar inside of
the Browser subtab. Using buttons on this tab toolbar, you can connect as a different user
and sort, filter, and further manipulate the data shown in the working pivot table view. Note
that there is an option to show only the top or bottom values (1, 2, 5, 10, or 25 members or
a percentage). Finally, if drillthrough is enabled for this cube, you can drill through using this
browser by right-clicking on a data cell and selecting that option. Drillthrough allows you
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 167

to see additional columns of information that are associated with the particular fact item (or
measure) that you’ve selected. You should spend some time experimenting with all the tool-
bar buttons so that you’re thoroughly familiar with the different built-in query options. Be
aware that each time you select an option, you’re generating an MDX query to the underly-
ing OLAP cube.

Note also that when you select cells in the grid, additional information is shown in a tooltip.
You can continue to manipulate this view for any sort of testing purposes. Possible actions
also include pivoting information from the rows to the column’s axis, from the slicer to the
filter, and so on. Conceptually, you can think of this manipulation as somewhat similar to
working with a Rubik’s cube. Of course, OLAP cubes generally contain more than three
dimensions, so this analogy is just a starting point.

Viewing OLAP Cube Properties and Metadata


If you next want to view the administrative properties associated with the particular OLAP
cube that you’re working with (as you did for dimensions), you simply right-click that cube in
the SSMS Object Browser and then click Properties. Similar to what you saw when you per-
formed this type of action on an OLAP dimension, you’ll then see a dialog box similar to the
one shown in Figure 6-12 that allows you to view some properties. The only properties you
can change in this view are those specifically associated with cube processing. As mentioned
previously, we’ll look at cube processing options in more detail in Chapter 9.

FIgure 6-12 OLAP cube properties in SSMS


168 Part II Microsoft SQL Server 2008 Analysis Services for Developers

By now, you can probably guess how you’d generate an XMLA metadata script for an OLAP
cube in SSMS. Just right-click the cube in the object browser and click Script Cube As on the
shortcut menu, choose Create To, and select New Query Editor Window. Note also that you
can generate XMLA scripts from inside any object property window. You do this by clicking
the Script button shown at the top of Figure 6-12.

Now that we’ve looked at both OLAP dimensions and cubes in SSMS, it’s time to look at a
different type of object—SSAS data mining structures. Although conceptually different, data
mining (DM) objects are accessed using methods identical to those we’ve already seen—that
is, browse, properties, and script.

How Do I View DM Structures?


As we begin our tour of SSAS data mining structures, we need to remember a couple of
concepts that were introduced earlier in this book. Data mining structures are containers
for one or more data mining models. Each data mining model uses a particular data min-
ing algorithm. Each data mining algorithm has one or more data mining algorithm viewers
associated with it. Also, each data mining model can be viewed using a viewer as well via a
lift chart. New to SQL Server 2008 is the ability to perform cross validation. Because many of
these viewing options require more explanation about data mining structures, at this point
we’re going to stick to the rhythm we’ve established in this chapter—that is, we’ll look at a
simple view, followed by the object properties, and then the XMLA. Because the viewers are
more complex for data mining objects than for OLAP objects, we’ll spend a bit more time
exploring.

We’ll start by browsing the Customer Mining data mining structure. Figure 6-13 shows the
result. What you’re looking at is a rendering of the Customer Clusters data mining model,
which is part of the listed structure. You need to select the Cluster Profiles tab to see the
same view. Note that you can make many adjustments to this browser, such as legend, num-
ber of histogram bars, and so on. At this point, some of the viewers won’t make much sense
to you unless you have a background using data mining. Some viewers are more intuitive
than others. We’ll focus on showing those in this section.

It’s also important for you to remember that although these viewers are quite sophisticated,
SSMS is not an end-user client tool. We find ourselves using the viewers in SSMS to demon-
strate proof-of-concept ideas in data mining to business decision makers (BDMs), however. If
these viewers look familiar to you, you’ve retained some important information that we pre-
sented in Chapter 2, “Visualizing Business Intelligence Results.” These viewers are nearly iden-
tical to the ones that are intended for end users as part of the SQL Server 2008 Data Mining
Add-ins for Office 2007. When you install the free add-ins, these data mining viewers become
available as part of the Data Mining tab on the Excel 2007 Ribbon. Another consideration for
you is this—similar to the OLAP cube pivot table viewer control in SSMS that we just finished
looking at, these data mining controls are also part of BIDS.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 169

FIgure 6-13 Data mining structure viewer in SSMS

In our next view, shown in Figure 6-14, we’ve selected the second mining model, Subcategory
Associations, associated with the selected mining structure. Because this second model has
been built using a different mining algorithm, after we make this selection the Viewer drop-
down list automatically updates to list the associated viewers available for that particular
algorithm. We then chose the Dependency Network tab from the three available views and
did a bit of tuning of the view, using the embedded toolbar to produce the view shown (for
example, sized it to fit, zoomed it, and so on).

An interesting tool that is part of this viewer is the slider control on the left side. This control
allows you to dynamically adjust the strength of association shown in the view. We’ve found
that this particular viewer is quite intuitive, and it has helped us to explain the power of data
mining algorithms to many nontechnical users.

As you did with the OLAP pivot table viewer, you should experiment with the included data
mining structure viewers. If you feel a bit frustrated because some visualizations are not yet
meaningful to you, we ask that you have patience. We devote Chapter 12, “Understanding
Data Mining Structures,” to a detailed explanation of the included data mining algorithms. In
that chapter, we’ll provide a more detailed explanation of most included DM views.
170 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 6-14 Data mining structure viewer in SSMS showing the Dependency Network view for the Microsoft
Association algorithm

Tip You can change any of the default color schemes for the data mining viewers in SSMS by
adjusting the colors via Tools, Options, Designers, Analysis Services Designers, Data Mining
Viewers.

Because the processes for viewing the data mining object administrative properties and for
generating an XMLA script of the object’s metadata are identical to those used for OLAP
objects, we won’t spend any more time reviewing them here.

How Do You Query SSAS Objects?


As with relational data, you have the ability to write and execute queries against multidimen-
sional data in SSMS. This is, however, where the similarity ends. The reason is that when you
work in an RDBMS, you need to write any query to the database using SQL. Even if you gen-
erate queries using tools, you’ll usually choose to perform manual tuning of those queries.
Tuning steps can include rewriting the SQL, altering the indexing on the involved tables, or
both.

SSAS objects can and sometimes are queried manually. However, the extent to which you’ll
choose to write manual queries will be considerably less than the extent to which you’ll query
relational sources. What are the reasons for this? There are several:

■■ MDX and DMX language expertise is rare among the developer community. With
less experienced developers, the time to write and optimize queries manually can be
prohibitive.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 171
■■ OLAP cube data is often delivered to end users via pivot table–type interfaces (that is,
Excel, or some manual client that uses a pivot table control). These interfaces include
the ability to generate MDX queries by dragging and dropping members of the cube
on the designer surface—in other words, by visual query generation.
■■ SSMS and BIDS have many interfaces that also support the idea of visual query genera-
tion for both MDX and DMX. This feature is quite important to developer productivity.

What we’re saying here is that although you can create manual queries, and SSMS is the
place to do this, you’ll need to do this significantly less frequently while working with SSAS
objects (compared to what you have been used to with RDBMS systems). It’s very important
for you to understand and embrace this difference. Visual development does not mean lack
of sophistication or power in the world of SSAS.

As you move toward understanding MDX and DMX, we suggest that you first monitor the
queries that SSMS generates via the graphical user interface. SQL Server Profiler is an excel-
lent tool to use when doing this.

What Is SQL Server Profiler?


SQL Server Profiler is an activity capture tool for the database engine and SSAS that ships
with SQL Server 2008. SQL Server Profiler broadly serves two purposes. The first is to moni-
tor activity for auditing or security purposes. To that end, SQL Server Profiler can be easily
configured to capture login attempts, access specific objects, and so on. The other main use
of the tool is to monitor activity for performance analysis. SQL Server Profiler is a powerful
tool—when used properly, it’s one of the keys to understanding SSAS activity. We caution
you, however, that SQL Server Profiler can cause significant overhead on production servers.
When you’re using it, you should run it on a development server or capture only essential
information.

SQL Server Profiler captures are called traces. Appropriately capturing only events (and asso-
ciated data) that you’re interested in takes a bit of practice. There are many items you can
capture! The great news is that after you’ve determined the important events for your par-
ticular business scenario, you can save your defined capture for reuse as a trace template.

If you’re familiar with SQL Server Profiler from using it to monitor RDBMS data, you’ll note
that when you set the connection to SSAS for a new trace, SQL Server Profiler presents you
with a set of events that is specific to SSAS to select from. See the SQL Server Books Online
topics “Introduction to Monitoring Analysis Services with SQL Server Profiler” and “Analysis
Services Event Classes” for more detailed information. Figure 6-15 shows some of the events
that you can choose to capture for SSAS objects. Note that in this view, we’ve selected Show
All Events in the dialog box. This switch is off by default.

After you’ve selected which events (and what associated data) you want to capture, you can
run your trace live, or you can save the results either to a file or to a relational table for you
172 Part II Microsoft SQL Server 2008 Analysis Services for Developers

to rerun and analyze later. The latter option is helpful if you want to capture the event on a
production server and then replace the trace on a development server for analysis and test-
ing of queries.

At this point, we’re really just going to use SQL Server Profiler to view MDX queries that are
generated when you manipulate the dimension and cube browsers in SSMS. The reason
we’re doing this is to introduce you to the MDX query language. You can also use SQL Server
Profiler to capture generated DMX queries for data mining structures that you manipulate
using the included browsers in SSMS.

FIgure 6-15 SQL Server Profiler allows you to capture SSAS-specific events for OLAP cubes and data mining
structures.

To see how query capture works, just start a trace in SQL Server Profiler, using all of the
default capture settings, by clicking Run on the bottom right of the Trace Properties dialog
box. With the trace running, switch to SSMS, right-click on the Adventure Works sample cube
in the object browsers, click Browse, and then drag a measure to the pivot table design area.

We dragged the Internet Sales Amount measure for our demo. After you’ve done that, switch
back to SQL Server Profiler and then click on the pause trace button on the toolbar. Scroll
through the trace to the end, where you should see a line with the EventClass showing Query
End and EventSubclass showing 0 - MDXQuery. Then click that line in the trace. Your results
should look similar to Figure 6-16.

Note that you can see the MDX query that was generated by your drag action on the pivot
table design interface in SSMS. This query probably doesn’t seem very daunting to you, par-
ticularly if you’ve worked with Transact-SQL before. Don’t be fooled, however; this is just the
tip of the iceberg.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 173

FIgure 6-16 SQL Server Profiler allows you to view MDX query text details.

Now let’s get a bit more complicated. Click the Play button in SQL Server Profiler to start the
trace again. After that, return to the SSMS OLAP cube pivot table browse area and then drag
and drop some dimension information (hierarchies or members) to the rows, columns, slicer,
and filter areas. After you have completed this, return to SQL Server Profiler and again pause
your trace and then examine the MDX query that has been generated. Your results might
look similar to what we show in Figure 6-17. You can see if you scroll through the trace that
each action you performed by dragging and dropping generated at least one MDX query.

FIgure 6-17 Detail of a complex MDX query

We find SQL Server Profiler to be an invaluable tool in helping us to understand exactly


what type of MDX query is being generated by the various tools (whether developer,
administrator, or end user) that we use. Also, SQL Server Profiler does support tracing data
mining activity. To test this, you can use the SSMS Object Browser to browse any data mining
model while a SQL Server Profiler trace is active. In the case of data mining, however, you’re
not presented with the DMX query syntax. Rather, what you see in SQL Server Profiler is the
text of the call to a data mining stored procedure. So the results in SQL Server Profiler look
something like this:

CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.
GetStatistics(‘Subcategory Associations’)

These results are also strangely categorized as 0 - MDXQuery type queries in the EventSub-
class column of the trace. You can also capture data mining queries using SQL Server Profiler.
These queries are represented by the EventSubclass type 1 – DMXQuery in SQL Server Profiler.
174 Part II Microsoft SQL Server 2008 Analysis Services for Developers

We’ll return to SQL Server Profiler later in this book, when we discuss auditing and
compliance. Also, we’ll take another look at this tool in Chapters 10 and 11, which we devote
to sharing more information about manual query and expression writing using the MDX
language. Speaking of queries, before we leave our tour of SSMS, we’ll review the methods
you can use to generate and execute manual queries in this environment.

Using SSAS Query Templates


Another powerful capability included in SSMS is that of being able to write and execute que-
ries to SSAS objects. These queries can be written in three languages: MDX, DMX, and XMLA.
At this point, we’re not yet ready to do a deep dive into the syntax of any of these three
languages; that will come later in this book. Rather, here we’d like to understand the query
execution process. To that end, we’ll work with the included query templates for these three
languages. To do this, we need to choose Template Explorer from the View menu, and then
click the Analysis Services (cube) icon to show the three folders with templated MDX, DMX,
and XMLA queries. The Template Explorer is shown in Figure 6-18.

FIgure 6-18 SSMS includes MDX, DMX, and XMLA query templates
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 175

You can see that the queries are further categorized into functionality type in child folders
under the various languages—such as Model Content and Model Management under DMX.
You can also create your own folders and templates in the Template Explorer by right-clicking
and then clicking New. After you do this, you’re actually saving the information to this loca-
tion on disk: C:\Users\Administrator\AppData\Roaming\Microsoft\Microsoft SQL Server\100\
Tools\Shell\Templates\AnalysisServices.

Using MDX Templates


Now that you’ve opened the templates, you’ll see that for MDX there are two types of que-
ries: expressions and queries. Expressions use the syntax With Member and create a calcu-
lated member as part of a sample query. You can think of a calculated member as somewhat
analogous to a calculated cell or set of cells in an Excel workbook, with the difference being
that calculated members are created in n-dimensional OLAP space. We’ll talk in greater depth
about when, why, and how you choose to use calculated members in Chapter 9.

Queries retrieve some subset of an OLAP cube as an ADO.MD CellSet result, and they do not
contain calculated members. To execute a basic MDX query, simply double-click the Basic
Query template in the Template Explorer and then connect to SSAS. You can optionally write
queries in a disconnected state and then, when ready, connect and execute the query. This
option is available to reduce resource consumption on production servers.

You need to fill the query parameters with actual cube values before you execute the query.
Notice that the query window opens yet another metadata explorer in addition to the default
Object Explorer. You’ll probably want to close Object Explorer when executing SSAS queries
in SSMS. Figure 6-19 shows the initial cluttered, cramped screen that results if you leave all
the windows open. It also shows the MDX parser error that results if you execute a query
with errors. (See the bottom window, in the center of the screen, with text underlined with a
squiggly line.)

Now we’ll make this a bit more usable by hiding the Object Explorer and Template Explorer
views. A subtle point to note is that the SSAS query metadata browser includes two filters: a
Cube filter and, below it, a Measure Group filter. The reason for this is that SSAS OLAP cubes
can contain hundreds or even thousands of measure groups.

Figure 6-20 shows a cleaned-up interface. We’ve left the Cube filter set at the default,
Adventure Works, but we’ve set the Measure Group filter to Internet Sales. This reduces the
number of items in the viewer, as it shows only items that have a relationship to measures
associated with the selected measure group. Also note that in addition to a list of metadata,
this browser includes a second nested tab called Functions. As you’d expect, this tab contains
an MDX function language reference list.
176 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 6-19 The SSMS SSAS query screen can be quite cluttered by default.

You might be wondering why you’re being presented with yet another metadata interface,
particularly because you’re inside of a query-writing tool. Aren’t you supposed to be writ-
ing the code manually here? Nope, not yet. Here’s the reason why—MDX object naming
is not as straightforward as it looks. For example, depending on uniqueness of member
names in a dimension, you sometimes need to list the ordinal position of a member name;
at other times, you need to actually list the name. Sound complex? It is. Dragging and drop-
ping metadata onto the query surface can make you more productive if you’re working with
manual queries.

To run the basic query, you need to replace the items shown in the sample query between
angle brackets—that is, <some value>—with actual cube metadata. Another way to under-
stand this is to select Specify Values For Template Parameters on the Query menu. You can
either type the information into the Template Parameters dialog box that appears, or you can
click on any of the metadata from the tree view in the left pane and then drag it and drop it
onto the designer surface template areas.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 177

FIgure 6-20 The SSMS SSAS query screen with fewer items in the viewer

We’ll use the latter approach to build our first query. We’ll start by dragging the cube name
to the From clause. Next we’ll drag the Customers.Customer Geography hierarchy from the
Customer dimension to the On Columns clause. We’ll finish by dragging the Date.Calendar
Year member from the Date hierarchy and Calendar hierarchy to the On Rows clause. We’ll
ignore the Where clause for now. As with Transact-SQL queries, if you want to execute only
a portion of a query, just select the portion of interest and press F5. The results are shown in
Figure 6-21.

FIgure 6-21 SSMS SSAS query using simple query syntax


178 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Do the results seem curious to you? Are you wondering which measure is being shown? Are
you wondering why only the top-level member of each of the selected hierarchies is shown
on columns and rows? As we’ve said, MDX is a deceptively simple language. If you’ve worked
with Transact-SQL, which bears some structural relationship but is not very closely related
at all, you’ll find yourself confounded by MDX. We do plan to provide you with a thorough
grounding in MDX. However, we won’t be doing so until much later in this book—we’ll use
Chapters 10 and 11 to unravel the mysteries of this multidimensional query language.

At this point in our journey, it’s our goal to give you an understanding of how to view and
run prewritten MDX queries. Remember that you can also re-execute any queries that you’ve
captured via SQL Server Profiler traces in the SSMS SSAS query environment as well.

Because we know that you’re probably interested in just a bit more about MDX, we’ll add a
couple of items to our basic query. Notably, we’ll include the MDX Members function so that
we can display more than the default member of a particular hierarchy on an axis. We’ll also
implement the Where clause so that you can see the result of filtering. The results are shown
in Figure 6-22.

We changed the dimension member information on Columns to a specific level (Country),


and then we filtered in the Where clause to the United States only. The second part of the
Where clause is an example of the cryptic nature of MDX. The segment [Product].[Product
Categories].[Category].&[1] refers to the category named Bikes. We used the drag (metadata)
and drop method to determine when to use names and when to use ordinals in the query.
This is a time-saving technique you’ll want to use as well.

FIgure 6-22 MDX query showing filtering via the Where clause
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 179

Using DMX Templates


Next we’ll move to the world of DM query syntax. Again, we’ll start by taking a look at the
included templates in the Template Explorer. They fall into four categories: Model Content,
Model Management, Prediction Queries, and Structure Content.

When you double-click on a DMX query template, you’ll see that the information in the
Metadata browser reflects a particular mining model. You can select different mining model
metadata in the pick list at the top left of the browser. Also, the functions shown now include
those specific to data mining. The Function browser includes folders for each data mining
algorithm, with associated functions in the appropriate folder. Because understanding how
to query data mining models requires a more complete understanding of the included algo-
rithms, we’ll simply focus on the mechanics of DMX query execution in SSMS at this point.

To do this, we’ll double-click the Model Attributes sample DMX query in the Model Content
folder that you access under DMX in the Template Explorer. Then we’ll work with the tem-
plated query in the workspace. As with templated MDX queries, the DMX templates indicate
parameters with the <value to replace> syntax. You can also click the Query menu and select
Specify Values For Template Parameters as you can with MDX templates. We’ll just drag the
[Customer Clusters] mining model to the template replacement area. Note that you must
include both the square brackets and the single quotes, as shown in Figure 6-23, for the
query to execute successfully.

FIgure 6-23 A DMX query showing mining model attributes


180 Part II Microsoft SQL Server 2008 Analysis Services for Developers

If you click on the Messages tab in the results area (at the bottom of the screen), you’ll see
that some DMX queries return an object of type Microsoft.AnalysisServices.AdomdClient.
AdomdDataReader. Other DMX query types return scalar values—that is, DMX prediction
queries.

For more information, see the SQL Server Books Online topic “Data Mining Extensions (DMX)
Reference.”

Using XMLA Templates


As with the previous two types of templates, SSMS is designed to be an XMLA query view-
ing and execution environment. The SSMS Template Explorer also includes a couple of
types of XMLA sample queries. These are Management, Schema Rowsets, and Server Status.
The XMLA language is an XML dialect, so structurally it looks like XML rather than a data
structure query language, such as MDX or DMX (which look rather Transact-SQL-like at first
glance). One important difference between MDX and XMLA is that XMLA is case-sensitive
and space-sensitive, following the rules of XML in general.

Another important difference is that the Metadata and Function browsers are not available
when you perform an XMLA query. Also, the results returned are in an XML format. In Figure
6-24, we show the results of executing the default Connections template. This shows detailed
information about who is currently connected to your SSAS instance.

Be reminded that metadata for all SSAS objects—that is, OLAP dimensions, cubes, data min-
ing models, and so on—can easily be generated in SSMS by simply right-clicking the object
in the Object Browser and then clicking Script As. This is a great way to begin to understand
the capabilities of XMLA. In production environments, you’ll choose to automate many
administrative tasks using XMLA scripting.

The templates in SSMS represent a very small subset of the XMLA commands that are avail-
able in SSAS. For a more complete reference, see the SQL Server Books Online topic “Using
XMLA for Analysis in Analysis Services (XMLA).” Another technical note: certain commands
used in XMLA are associated with a superset of commands in the Analysis Services Scripting
Language (ASSL). The MSDN documentation points out that ASSL commands include both
data definition language (DDL) commands, which define and describe instances of SSAS and
the particular SSAS database, and also XMLA action commands such as Create, which are
then sent to the particular object named by the ASSL. ASSL information is also referred to as
binding information in SQL Server Books Online.
Chapter 6 Understanding SSAS in SSMS and SQL Server Profiler 181

FIgure 6-24 SSAS XMLA connections query in SSMS

Closing Thoughts on SSMS


Although our primary audience is developers, as discussed, we’ve found that many SSAS
developers are also tasked with performing SSAS administrative tasks. For this reason, we
spent an entire chapter exploring the SSMS SSAS interface. Also, we find that using SSMS to
explore built objects is a gentle way to introduce OLAP and DM concepts to many interested
people. We’ve used SSMS to demonstrate these concepts to audiences ranging from .NET
developers to business analysts. Finally, we’d like to note that we’re continually amazed at the
richness of the interface. Even after having spent many years with SSAS, we still frequently
find little time-savers in SSMS.
182 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Summary
In this chapter, we took a close look at the installation of SSAS. We then discussed some
tools you’ll be using to work with SSAS objects. We took a particularly detailed look at SSMS
because we’ve found ourselves using it time and time again on BI projects where we were
tasked with being developers. Our real-world experience has been that SSAS developers
must often also perform administrative tasks. So knowledge of SSMS can be a real time-
saver. We included an introduction to SQL Server Profiler because we’ve found that many cli-
ents don’t use this powerful tool correctly or at all because of a lack of understanding of it.

By now, we’re sure you’re more than ready to get started developing your first OLAP cube
using BIDS. That’s exactly what we’ll be doing starting in the next chapter and then continu-
ing on through several additional chapters—being sure to hit all the important nooks and
crannies along the way.
Chapter 7
Designing OLAP Cubes Using BIDS
In Chapters 1 through 6, we set the stage for you to begin developing Microsoft SQL Server
Analysis Services (SSAS) OLAP cubes using Business Intelligence Development Studio (BIDS).
In those chapters, we defined business intelligence (BI) and introduced you to some of its
terminology, concepts, languages, and process and modeling methodologies. We then
looked at other tools you’ll be using, such as SQL Server Management Studio and Microsoft
SQL Server Profiler. Now, you’re ready to roll up your sleeves and get to work in BIDS. In
this chapter, we’ll tour the BIDS interface for Analysis Services OLAP cubes by looking at the
AdventureWorksample cube in the Adventure Works DW 2008 OLAP database. We’ll also
build a simple cube from scratch so that you can see the process to do that.

Using BIDS
BIDS is the primary tool in which you’ll be working as you develop your BI project. The good
news is that many of the tool concepts that you’ve learned from working in SQL Server
Management Studio (SSMS) are duplicated in BIDS, so you’ll have a head start. In this sec-
tion, we’ll first focus on how to use BIDS, and then follow with what specific tasks to perform
in this chapter and through several more chapters, as we dig deep into OLAP cube and data
mining structure building using BIDS.

As mentioned previously, SSAS installs BIDS in one of two ways. Either it installs BIDS as a
set of templates into an existing Visual Studio 2008 SP1 installation when you install SSAS,
or BIDS installs as a stand-alone tool called the Visual Studio Shell if no installation of Visual
Studio is present when you install SSAS. In either case, you start with solutions and projects,
so we’ll begin there.

When you’re starting with a blank slate, you’ll open BIDS and then create a new project. For
SSAS objects—which include both OLAP cubes and data mining (DM) structures—you’ll use
the SSAS template named Analysis Services Project, which is shown in the New Project dialog
box in Figure 7-1. You can see in this figure that the dialog box also includes templates for
SQL Server Integration Services (SSIS) and SQL Server Reporting Services (SSRS). Those will be
covered in the sections pertaining to the creation of those types of objects—that is, packages
and reports.

183
184 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-1 BIDS installs as a series of templates in an existing Visual Studio 2008 instance.

After you click OK, you’ll see the full BIDS development environment. If you have experience
using Visual Studio, you’ll find that many views, windows, and tools in BIDS are familiar to you
from having worked with similar structures in Visual Studio. If the Visual Studio development
environment is new to you, you’ll have to allow yourself time to become accustomed to the
development interface.

Let’s look first at the Solution Explorer view, which appears by default on the right side
of your work area. It contains a series of containers, or folders, to group the various SSAS
objects. If you click the Show All Files icon on the top toolbar in Solution Explorer, you’ll see
that exactly one file has been generated. In our case, it’s using the default name or Analysis
Services Project1.database. If you right-click that file name and then click View Code, you’ll
see the XMLA metadata. XMLA metadata is created as a result of actions or selections that
you make while developing SSAS objects in the BIDS environment. This pattern is one that
you’ll see repeated as you continue to work in this environment. Although you can hand-edit
the XMLA produced in BIDS, you won’t often do so. Rather, you’ll tend to use the generated
XMLA in native form when performing administrative tasks—for example, backing up, restor-
ing, copying, and moving SSAS objects—after you’ve completed your design work. You might
remember from previous chapters that you can easily generate XMLA from SSMS as well.

We’re now almost ready to start creating SSAS objects, but first let’s look at one more consid-
eration. SSAS development has two modes: offline and online. You need to understand the
implications of using either of these modes when doing development work using BIDS.

Offline and Online Modes


When you create a new SSAS project, you’re working in an offline, or disconnected, mode. As
previously mentioned, when you begin to create SSAS objects using the GUI designers, tem-
plates, and wizards in BIDS, you’ll be creating XMLA metadata, MDX queries or expressions,
and DMX statements. These items must be built and then deployed to a live SSAS server
instance for the objects to actually be created on the SSAS server (and for them to be avail-
able to be populated with source data). The steps for doing this are building and processing.
These two steps can be completed using BIDS. You can also use other methods, such as SSMS
or script, to deploy, but not to perform a build.
Chapter 7 Designing OLAP Cubes Using BIDS 185

After SSAS objects have been built and deployed, you then have a choice about the sub-
sequent development method. You can work live (connected to the SSAS service), or you
can work in a disconnected fashion. Both methods have advantages and disadvantages. In
the case of live development, of course, there’s no lag or latency when implementing your
changes. However, you’re working with a live server; if you choose to work with a production
server, you could end up making changes that adversely affect performance for other users.
In the worst case, you could make breaking changes to objects that others expect to access.
For these reasons, we recommend using the live development approach only when using
dedicated development servers in a single developer environment.

To connect to an existing, live SSAS solution using BIDS, choose Open from the File menu
and then select Analysis Services Database. You’ll then be presented with the dialog box
shown in Figure 7-2. There you’ll select the SSAS instance, SSAS database name, and whether
or not you’d like to add the current solution. We’ll open the Adventure Works DW 2008 sam-
ple. As mentioned previously, this sample is downloadable from CodePlex. Again, we remind
you, if you choose to work while connected, all changes you make are immediately applied
to the live (deployed) instance.

FIgUre 7-2 Connecting to a live SSAS database in BIDS

Your other option is to work in an offline mode. If you choose this option and have more
than one SSAS developer making changes, you must be sure to select and implement a
source control solution, such as Visual Studio Team System or something similar. The reason
186 Part II Microsoft SQL Server 2008 Analysis Services for Developers

for this is that the change conflict resolution process in BIDS is primitive at best. What we
mean is this: If multiple developers attempt to process changes, only the last change will
win and all interim changes will be overwritten. This behavior is often not desirable. If more
than one developer is on your team, you’ll generally work in offline mode. Teams working in
offline mode must understand this limitation.

BIDS gives you only minimal feedback to show whether you’re working live or not. If
you’re working live, the name of the server to which you’re connected (in our case, WIN-
VVP8K0GA45C) is shown in the title bar, as you can see in Figure 7-3.

FIgUre 7-3 The server name is shown in BIDS if you’re working in connected mode.

As we show you how to create BIDS objects, we’ll use two instances of BIDS running on the
same computer. This is a technique that is best suited only to learning. We’ll first look at the
sample SSAS objects by type using the live, connected BIDS instance and the Adventure
Works DW 2008 sample. At each step, we’ll contrast that with the process used to create
these objects using the default, blank SSAS template and disconnected BIDS instance. We’ll
follow this process over the next few chapters as we drill down into the mechanics of creating
these objects in BIDS.

Working in Solution explorer


The starting point for most of your development work will be to create or update SSAS
objects listed in Solution Explorer. The simplest way to work with this interface is to select the
object or container (folder) of interest and then click on the relevant item from the shortcut
menu. All the options are also available on the main menu of BIDS, but we find it fast to work
directly in Solution Explorer.

One other BIDS usability tip is that you should keep in mind that object properties appear
in two locations. They’re located at the bottom right in Solution Explorer, as has been tradi-
tional in Visual Studio, and sometimes you’ll find additional property sheets after you click
on the Properties item on an object’s shortcut menu. Also, the properties available change
depending on whether you’re working in live or disconnected mode. For example, if you
right-click on the top node of the disconnected BIDS instance and select Properties on the
resulting menu, you see the property sheet shown in Figure 7-4. It allows you to configure
various options related to building, debugging, and deploying the SSAS solution.

However, if you attempt that same action in the connected instance, no pop-up dialog box is
shown. Rather, an empty Properties window is displayed in the bottom right of the develop-
ment environment. If you’re working live, and there’s no need to configure build, debug, or
Chapter 7 Designing OLAP Cubes Using BIDS 187

deploy settings, this behavior makes sense. However, the surface inconsistency in the inter-
face can be distracting for new SSAS developers.

FIgUre 7-4 SSAS top-level properties for a disconnected instance

In addition, some shortcut-menu options occur in unexpected locations. An example of this is


the Edit Database option, which is available after you click the top node in Solution Explorer.
This brings up the General dialog box, which has configurable properties. On the Warnings
tab, shown in Figure 7-5, you can enable or disable design best practice warnings that have
been added to SQL Server 2008 SSAS. The reason these warnings were added is that many
customers failed to follow best practices in OLAP and data mining modeling and design
when using SSAS 2005. This resulted in BI objects that performed poorly under production
load. We’ll be looking more closely at these warnings as we proceed with object building.

FIgUre 7-5 SSAS database properties for a disconnected instance

At this point, we’ll just point out where you can review the master list and enable or turn off
warnings. These are only recommendations. Even if you leave all the warnings enabled, you’ll
still be able to build SSAS objects with any kind of structure using BIDS. None of the warn-
ings prevent an object from building. Object designs that violate any of the rules shown in
the warnings result in a blue squiggly line under the offending code and a compiler warning
at build time. Illegal design errors produce red squiggly line warnings, and you must first cor-
rect those errors before a successful build can occur.

We’ll look at a subset of SSAS objects first. These objects are common to all BI solutions.
That is, you’ll create and use data sources, data source views, roles, and (optionally) assem-
blies when you build either cubes (which contain dimensions) or mining structures. You can,
188 Part II Microsoft SQL Server 2008 Analysis Services for Developers

of course, build both cubes and mining structures in the same solution, although this is less
common in production situations. In that case, we typically create separate solutions for
cubes and for mining structures. Also, this first group of objects is far simpler to understand
than cubes or mining structures.

Data Sources in Analysis Services


A BI SSAS data source is simply a connection to some underlying data source. Be sure to keep
in mind that SSAS can use any type of data source that it can connect to. Many of our cus-
tomers have had the misperception that SSAS could use only SQL Server data as source data.
This is not true! As mentioned earlier, SSAS is now the top OLAP server for RDBMS systems
other than SQL Server, notably Oracle. The reason for this is that businesses find total cost of
ownership (TCO) advantages in Microsoft’s BI offering compared to BI products from other
vendors.

We’ll start by examining the data source that is part of the Adventure Works DW 2008
sample. First, double-click the Adventure Works DW data source in Solution Explorer. You’ll
then see the Data Source Designer dialog box with editable configuration information. The
General tab contains the Data Source Name, Provider, Connection String, Isolation Level,
Query Timeout, Maximum Number Of Connections, and (optional) Description sections. The
Impersonation Information tab, shown in Figure 7-6, contains the options for connection
credentials.

FIgUre 7-6 Data source configuration dialog box for connection impersonation settings

In Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” we talked about the
importance of using an appropriately permissioned service account for SSAS. Reviewing
the Data Source Designer dialog box, you might be reminded of why this configura-
tion is so important. Also, you might want to review the SQL Server Books Online topic,
“Impersonation Information Dialog Box,” to understand exactly how credentials are passed by
SSAS to other tiers.
Chapter 7 Designing OLAP Cubes Using BIDS 189

Now we’ll switch to our empty (disconnected) BIDS instance and create a new data source
object. The quickest way to do so is for you to right-click the data source folder in Solution
Explorer and then click New Data Source. You’ll then be presented with a wizard that will
guide you through the steps of setting up a data source.

At this point, you might be surprised that you’re being provided with a wizard to complete a
task that is relatively simple. There are two points to remember here. First, BIDS was designed
for both developers and for nondevelopers (that is, administrators and business analysts) to
be able to quickly and easily create SSAS objects. Second, you’ll reach a point where you’ll be
happy to see a wizard because some SSAS objects are quite complex. We think it’s valuable
to simply click through the wizard settings so that you see can see the process for creating
any data source.

If you want to examine or change any configuration settings, you can simply double-click
the new data source you created. Also, you can view the XMLA metadata that the wizard has
generated by right-clicking on the new data source in Solution Explorer and then clicking
View Code.

Before we leave data sources, we’ll cover just a couple more points. First, it’s common to have
multiple data sources associated with a single SSAS project. These sources can originate from
any type or location that SSAS is permitted to connect to. Figure 7-7 shows a list of included
providers.

Tip If you’re connecting to SQL Server data, be sure to use the default provider, Native OLE DB\
SQL Server Native Client 10.0. Using this provider will result in the best performance.

FIgUre 7-7 List of available providers for data source connections


190 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Second, it’s important that you configure data sources with least-privileged accounts. Too
many times in the real world, we’ve seen poor practices, such as an administrator-level
account being used for this. Failing to use least-privileged connection accounts for data
source objects presents unacceptable security vulnerability.

If you choose to use a specific Microsoft Windows user name and password, be aware that,
by default, BIDS does not save passwords in the connection string. So you’ll be prompted
for a password when you attempt to connect if you’re using this type of connection
authentication.

If you choose to use the credentials of the current user, be aware that BIDS does not support
impersonation of the current user for object processing (for example, cubes). If you attempt
to process using this authentication mode, you’ll receive an impersonation mode error during
any processing attempts. Remember that once you move to production, you can choose to
automate processing via XMLA scripting or by using SSMS.

After you’ve created and configured your connections to data sources, you’ll next want to
create data source views. We’ll define this object type, discuss the reasons for its existence,
and then look at the mechanics of creating a data source view.

Data Source Views


A data source view (DSV) is an SSAS object that represents some type of view of the data that
you’ve created a data source (or connection) to. If your environment is simple and small—for
example, your source data originates from a single RDBMS, probably SQL Server, and you’re
administrator of that server as well as the server where SSAS is running—the purpose of the
DSV object will not be obvious to you.

It’s only when you consider larger BI projects, particularly those of enterprise size, that the
reason for DSVs becomes clear. A DSV allows you to define some subset of your source data
as data that is available for loading into OLAP cubes, data mining structures, or both.

If you own (administer) the source data, it’s likely that you’ve already created views (saved
Transact-SQL queries) to prepare the data for loading into the source system. If that’s the
case, you can simply reference these saved queries rather than the underlying base tables.

However, if you have limited permissions to the source systems (such as read-only access), as
will generally be the case in larger BI implementations, DSVs allow you to pull in information
in its original format and then perform shaping via the SSAS service. This shaping can include
renaming source tables and columns, adding and removing relationships, and much more.

Let’s start by examining the DSVs that are part of the Adventure Works DW 2008 sample. To
do this, navigate to the Data Source View folder in Solution Explorer and then double-click
Adventure Works DW to open it in the designer. You might need to click on the first entry
in the Diagram Organizer (<All Tables>) in the top left of the workspace to have the tables
Chapter 7 Designing OLAP Cubes Using BIDS 191

displayed on the designer surface. We adjusted the zoom to 25 percent to get all the tables
to fit in the view. Figure 7-8 shows the result.

FIgUre 7-8 The data source view for the Adventure Works DW 2008 sample is complex.

This is the point where SSAS developers begin to see the complexity of an OLAP cube. You’ll
remember that a correctly designed cube is based on a series of source star schemas (and
those star schemas have been created based on the validated grain statements (for example,
“We want to view sales amount by each product, by each day, by each customer, and so on”).
This flattened visualization of the entire cube is less than optimal for learning.

We recommend that you examine a single source star schema. That is the purpose of the
Diagram Organizer section at the top left of the workspace. To start, click on the Internet
Sales entry in the Diagram Organizer section.

In addition to being able to visualize the star schemas—that is, tables and relationships—the
DSV includes multiple viewers that are designed to help you visualize the source data. If you
remember that the assumption is that you might have limited access either to the source
data or to its included query tools, the inclusion of these viewers makes sense. To access
these viewers—which include a table, pivot table, chart, and pivot chart view of the data—
you simply right-click on any table (or view) that is displayed on the DSV designer surface (or
in the Tables list shown in the metadata browser to the left of the designer surface) and then
click Explore Data.
192 Part II Microsoft SQL Server 2008 Analysis Services for Developers

After you’ve done that, if you’ve chosen a pivot table or pivot chart, you can manipulate
the attribute values being shown. In the case of a pivot chart, you can also select the for-
mat of the chart. We find that this ability to quickly take a look at source data can be a real
time saver when attempting to perform an informal validation of data, particularly for quick
prototypes built with samples of actual data. Figure 7-9 shows a view of the data from the
Products table shown in a chart view.

FIgUre 7-9 DSVs include pivot chart source data browsers.

In addition to viewing the source data, you can actually make changes to the metadata that
will be loaded into your SSAS objects. You do this by changing metadata—that is, renaming
tables, renaming columns, changing relationships, and so on—and by defining new columns
and tables. New calculations are called named calculations. They are basically calculated col-
umns. New tables are called named queries. Again, the thinking is that you’ll use this feature
if you can’t or don’t want to change source data. To add a named query in a particular DSV,
right-click in an empty area on the designer surface and choose New Named Query. This will
look quite familiar to SQL Server developers and administrators if SQL Server is the source
RDBMS for your DSV. That’s because this is an identical dialog box to the query designer that
you see in SSMS when connected to SQL Server. Figure 7-10 shows this dialog box.

This functionality allows you to create a view of one or more source tables or views to be
added to your DSV. You can see in Figure 7-10 that we’ve selected multiple tables and, just
Chapter 7 Designing OLAP Cubes Using BIDS 193

like in the RDBMS, when we did that, the query design tool automatically wrote the Transact-
SQL join query for us.

FIgUre 7-10 Named queries can be added to DSVs.

In addition to making changes to the DSV by adding named queries, which, in essence, pro-
duce new virtual tables, you can also make changes to the existing tables by adding virtual
columns. This type of a column is called a named calculation. To open this dialog box, click
on the table name of the table you want to affect in the DSV designer and then click on New
Named Calculation. A dialog box with the name Create Named Calculation will open on the
DSV designer surface. You can then enter information into the dialog box. Unlike the Named
Query designer, you have to type queries into the Create Named Calculation dialog box
without any prompting. These queries must be written in syntax that the source system can
understand.

Figure 7-11 shows the dialog box for a sample named calculation that is part of the Product
table in the Adventure Works DW 2008 sample. Columns created via named calculations are
shown in the source table with a small, blue calculator icon next to them.

Now that we’ve explored the mechanics of how to create DSVs and what you can see and do
(to make changes) in DSVs, let’s talk a bit about best practices for DSV creation. It’s with this
object that we’ve seen a number of customers begin to go wrong when working in SSAS.
194 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-11 Named calculations can be added to tables in the DSV.

As we mentioned, the DSVs are the basis for creating both OLAP cubes and DM structures. In
particular, the OLAP cube designer in BIDS expects a star schema source structure from DSVs.
The more closely you comply with this expectation in your creation of your DSV, the more
easily and quickly you’ll be able to build cubes that perform well. We covered dimensional
modeling extensively in Chapter 5, “Logical OLAP Design Concepts for Architects.” If you’re
new to OLAP, you might want to review that information now.

In a nutshell, each star schema should have at least one fact table and some related dimen-
sion tables. Dimension tables are denormalized structures, typically containing many columns
(that is, wide tables), each describing one entity (for example, customer, product, or date).
A dimension table should contain the original, source primary key and a newly generated
unique primary key, as well as attribute information such as name, status, color, and so on.
Individual dimensions are sourced (usually) from a single table. This is called a star design.
Dimension tables can originate from multiple, related tables. This is called a snowflake design.
An example of a snowflake dimension source is the group of Product, ProductSubcategory,
and ProductCategory tables. There should be a business justification for snowflaking source
dimension tables. An example of a business reason is that values are changeable in one
source dimension table and not in another related one.

Fact tables should be narrow (or contain few columns). They should contain foreign keys
relating each row to one or more dimension table–type rows. Fact tables should also contain
fact values. Fact values are sometimes called measures. Facts are usually numeric and most
often additive. Some examples are OrderQuantity, UnitPrice, and DiscountAmount.

A common mistake we’ve seen is for SSAS developers to simply make a DSV of an existing
RDBMS without giving any consideration to OLAP modeling. Don’t do this! An OLAP cube is
a huge, single structure intended to support aggregation and read-only queries. Although
the SSAS engine is very fast and well optimized, it simply can’t perform magic on normalized
Chapter 7 Designing OLAP Cubes Using BIDS 195

source data. The resultant cube will most often be difficult to use (and understand) for end
users and often will be unacceptably slow to work with.

If you’re new to OLAP modeling, spend some time looking at the DSV associated with the
Adventure Works DW 2008 sample. Then follow our earlier advice regarding a design-first
approach—that is, write solid grain statements, design an empty OLAP destination structure,
and then map source data to the destination structure. Finally, use the SSIS tools to perform
data extract, transform, and load (as well as cleansing and validation!) and then materialize or
populate the destination structure on disk.

Do all of this prior to creating your DSV. We understand the effort involved. As mentioned,
the extract, transform, and load (ETL) process can be more than 50 percent of the initial BI
project’s time; however, we haven’t found any way around this upfront cost. In addition,
cleaning and validating data is extremely important to the integrity of your BI solution. This
is time well spent. The purpose of your BI project is to make timely, correct, validated data
available in an easy-to-understand and quick-to-query fashion. You simply can’t skip the
OLAP modeling phase. We’ve seen this mistake repeated repeatedly and cannot overempha-
size the importance of correct modeling prior to beginning development.

Later in this chapter, we’ll begin the cube and mining structure building process. First we’ve
got a few more common items to review in BIDS. The first of these is the Role container.

Roles in Analysis Services


It’s important that you understand that the only user with access to SSAS objects by default
is the SSAS administrative user or users assigned during setup. In other words, the computer
administrator or the SQL administrator do not get access automatically. This is by design. This
architecture supports the “secure by default” paradigm that is simply best practice. To give
other users access, you’ll need to create roles in SSAS. To do this, you simply right-click on the
Roles container in BIDS and choose New Role. The interface is easy to use. Figure 7-12 shows
the design window for roles.

FIgUre 7-12 SSAS roles allow you to enable access to objects for users other than the administrator.
196 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The key to working with the role designer is to review all the tabs that are available in BIDS.
You can see from Figure 7-12 that you have the following tabs to work with:

■■ General Allows you to set SSAS database-level permissions. Remember that a data-
base in BIDS includes the data source and DSV, so it includes all objects—that is, all
cubes, all mining structures, and so on—associated with this particular project. You can
(and will) set permissions more granularly if you create multiple SSAS objects in the
same SSAS project by using the other tabs in the role designer.
■■ Membership Allows you to associate Windows groups or users with this role. The
Windows users or groups must already exist on the local machine or in Active Directory
if your SSAS server is part of a domain.
■■ Data Sources Allows you to assign permissions to specific data sources.
■■ Cubes Allows you to assign permissions to specific cubes.
■■ Cell Data Allows you to assign permission to specific cells in particular cubes.
■■ Dimensions Allows you to assign permissions to specific dimensions.
■■ Dimension Data Allows to assign permissions to specific dimension members.
■■ Mining Structures Allows you to assign permissions to specific mining structures.

You’ll also want to take note of the tabs for the Roles object. When you select a particular
tab, you’re presented with an associated configuration page where you can set permissions
and other security options (such as setting the default viewable member) for the particular
object that you’re securing via the role.

Permission types vary by object type. For example, OLAP cubes with drillthrough (to source
data) enabled require that you assign drillthrough permission to users who will execute
drillthrough queries. You can also change default members that are displayed by user for
cube measures and for particular dimensions and dimensional hierarchies. After we’ve
reviewed and created the core SSAS objects—that is, cubes, dimensions, and mining struc-
tures—we’ll return to the topic of specific object permission types.

Using Compiled Assemblies with Analysis Services Objects


As with the SQL Server 2008 RDBMS, you can write .NET assemblies in a .NET language, such
as C# or Visual Basic .NET, which will extend the functionality available to a particular SSAS
instance or database. You can also write assemblies as COM libraries.

You can create types and functions using .NET languages, and then associate these com-
piled assemblies with SSAS objects. An example of this type of extension is a function that
you write to perform some task that is common to your BI project. There are some examples
on CodePlex (http://www.codeplex.com), including a project named Analysis Services Stored
Procedure Project. One assembly from this CodePlex project is called Parallel. It contains two
Chapter 7 Designing OLAP Cubes Using BIDS 197

functions: ParallelUnion and ParallelGenerate. These functions allow two or more set opera-
tions to be executed in parallel to improve query performance for calculation-heavy queries
on multiprocessor servers.

After you write an assembly, you must associate it with an SSAS server or database instance.
To associate an assembly with the SSAS server, you can either use SSMS or BIDS. If you’re
using BIDS, you associate an assembly with an SSAS database instance by right-clicking the
Assembly folder in BIDS and then configuring the code access security permissions (for .NET
assemblies only) and the security context information via the properties pane after you
define the path to the assembly. In SSMS, the dialog box to associate and configure assem-
blies contains settings for you to configure the Code Access Security (CAS) and the security
context (impersonation) for the assembly. Figure 7-13 shows the dialog box from SSMS.

FIgUre 7-13 Custom assemblies allow you to add custom logic to SSAS.

Note that four assemblies are associated by default with an SSAS server instance: ExcelMDX,
System, VBAMDX, and VBAMDXINTERNAL. What is interesting is that the MDX core query
library is implemented via these assemblies. The MDX function library bears a strong resem-
blance to the Microsoft Office Excel function library. This is, of course, by design because
functions you use with SSAS objects are used for calculation. The difference is the structure
of the target data source.

Creating custom assemblies is an advanced topic. In all of our BI projects, we’ve used only cus-
tom assemblies with one client. We recommend that if you plan to implement this approach
you thoroughly review all the samples on the CodePlex site first. For more information, see
the SQL Server Books Online topic, “Assemblies (Analysis Services – Multidimensional Data).”
198 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Building OLAP Cubes in BIDS


We’re ready now to move to a core area of BIDS—the cube designer. We’ve got a couple
of items to review before launching into building our first OLAP cube. We’ll use our two
instances of BIDS to look at the development environment in two situations. You’ll remember
that the first instance is a disconnected, blank environment and the second is working with
an existing, connected cube. We’ll also talk about the uses of the Cube Wizard. Surprisingly, it
has been designed to do more than simply assist you with the OLAP cube-building process.

To understand this, we’ll start by right-clicking on the Cubes folder in Solution Explorer for
the BIDS instance that is blank (the one that is disconnected). Choosing the New Cube option
opens the Cube Wizard. The first page of the wizard is purely informational, and selecting
Next takes you to the next page. Note that this page of the wizard, shown in Figure 7-14, has
three options available for building an OLAP cube:

■■ Use Existing Tables


■■ Create An Empty Cube
■■ Generate Tables In The Data Source (with the additional option of basing this cube on
any available source XMLA templates)

FIgUre 7-14 To create a cube based on existing tables, you must first create a DSV for that data source.

You might be wondering why the option that you’d probably want to use at this point to cre-
ate a new cube—that is, Use Existing Tables—is grayed out and not available. This is because
we have not yet defined a DSV in this project. As we mentioned, if you have administrative
permissions on the source servers for your cube, it might not be obvious to you that you
Chapter 7 Designing OLAP Cubes Using BIDS 199

need to create a DSV because you can just make any changes you want to directly in that
source data. These changes can include adding defined subsets of data, adding calculated
columns, and so on. And these changes are usually implemented as relational views.

As mentioned, DSVs exist to support SSAS developers who do not have permission to create
new items directly in source data. Whether you do or don’t, you should know that a DSV is
required for you to build an OLAP cube, which will use data from your defined data source as
part of your OLAP cube.

After you define a DSV, which you can do by right-clicking on that folder in Solution Explorer,
click New Data Source View, select an existing data source, add all the appropriate tables
from the source to the DSV, and then complete the wizard. This makes the tables and views
available to your DSV. After you create a DSV and rerun the Cube Wizard, the Use Existing
Tables option in the Cube Wizard becomes available. This is the usual process you take to
create production cubes.

Before we create a cube based on existing tables, however, let’s first take a minute to under-
stand what the other two options in the Cube Wizard do. You can use the Create An Empty
Cube option in two different ways. First, if you create an empty cube and do not base it on
a DSV, you can later associate only existing dimensions from another database with this new
cube. These dimensions are called linked dimensions. Second, if you create an empty cube
and do base it on a DSV, you can create new measures based on columns from fact tables
referenced by the DSV. As with the first case, you can associate only existing dimensions with
this newly defined cube. The purpose of the Create An Empty Cube option is for you to be
able to create new cubes (by creating new measures) and then associate those measures with
existing dimensions in a project. So why would you want to make more than one cube in a
project? One reason is to perform quick prototyping of new cubes. We’ll explore the answer
to that question in more depth later.

The Generate Tables In The Data Source option also has two different methods of executing.
You can choose either not to use template files as a basis for your cube, or you can base your
cube on source templates. SSAS ships with sample source templates (based on the Adventure
Works DW 2008 sample OLAP cube structure) for both editions: Enterprise and Standard.
These templates contain the XMLA structural metadata that defines cube measures and
dimensions. You can use this as a basis and then modify the metadata as needed for your
particular BI project. These template files are located by default at C:\Program Files\Microsoft
SQL Server\100\Tools\Templates\olap\1033\Cube Templates.

Note On x64 systems, substitute “Program Files (x86)” for “Program Files” in the path referenced
above.
200 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 7-15 shows one of the pages that the wizard presents if you choose to use a template.
On the Define New Measures page of the Cube Wizard, you can select new measures in your
cube, create new measures, or both. This wizard also includes a similar page that allows you
to select existing dimensions, create new dimensions, or both. On the last page of the wizard,
you are presented with the Generate Schema Now option. When selected, this option allows
you to generate source RDBMS code so that you can quickly create a star schema structure
into which original source data can be loaded through the ETL process. In the case of a SQL
Server source, Transact-SQL data definition language (DDL) code is generated. If you select
the Generate Schema Now option, the Generate Schema Wizard opens after you click Finish in
the Cube Wizard. You’ll then be presented with the option of generating the new schema into a
new or existing DSV.

FIgUre 7-15 Using the Cube Wizard with an included template

So why would you select the Generate Tables In The Data Source option? It allows you to
model a cube using BIDS without associating it to a data source. In other words, you can use
BIDS as a design environment for OLAP modeling. As mentioned earlier in this book, using
BIDS in this way assumes that you have a thorough understanding of OLAP modeling con-
cepts. If you do have this understanding, using BIDS in this way can facilitate quick construc-
tion of empty prototype star schemas (or DSVs). We sometimes use this method (BIDS, create
cube, with no associated DSV) to create empty star schema tables during the prototyping
phase of our BI projects rather than using a more generic database modeling tool such as
Visio.

At this point in our exploration, we’d rather create a cube based on a data source. Be aware
that the sample Adventure Works DW 2008 is modeled in a traditional star-schema-like way.
We’ll use this as a teaching tool because it shows the BIDS Cube Wizard in its best light. A
Chapter 7 Designing OLAP Cubes Using BIDS 201

rule of thumb to facilitate rapid development is that you start with a star schema source, as
much as is practical, and then deviate from it when business requirements justify doing so.
We’ll follow this best practice in our learning path as well.

Create a quick DSV in the disconnected instance by right-clicking on the data source view
container, setting up a connection to the AdventureWorksDW2008 relational database, and
then selecting all tables and views without making any adjustments. Next double-click the
newly created DSV and review the tables and views that have been added. As we proceed on
our OLAP cube creation journey, we’ll first review the sample OLAP cube and then build simi-
lar objects ourselves in our second BIDS instance.

Examining the Sample Cube in Adventure Works


To get started, double-click the Adventure Works cube in Solution Explorer to open the cube
designer in BIDS. As you’ve seen previously for some other SSAS objects, such as roles, open-
ing any BIDS designer reveals a wealth of tabs. We’ve shown the cube-related tabs in Figure
7-16. These tabs are named as follows: Cube Structure, Dimension Usage, Calculations, KPIs,
Actions, Partitions, Aggregations, Perspectives, Translations, and Browser.

FIgUre 7-16 The available tabs in the cube designer

The only cube tab we’ve looked at previously is the Browser tab. You might recall from our
discussions in an earlier chapter that the Browser tab options serve as a type of pivot table
control. The Browser tab items allow you to view your OLAP cube in an end-user-like envi-
ronment. For the remainder of this chapter, as well as for future chapters, we’ll explore each
of these tabs in great detail.

You’ll note also that there is an embedded toolbar below each tab. In addition, the designer
surface below each tab contains shortcut (right-click) menus in many areas. You can, of
course, always use the main menus in BIDS; however, we rarely do so in production, prefer-
ring to use the embedded toolbars and the internal shortcut menus. This might seem like a
picky point, but we’ve found that using BIDS in this way really improves our productivity.

Let’s take a closer look at the Cube Structure tab. (See Figure 7-17.) To the left, you’ll see a
metadata browser, which is similar to the one you saw when executing MDX queries in SSMS.
It includes a section for the cube measures and another one for the cube dimensions.
202 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-17 The BIDS Cube Structure tab contains a metadata browser.

Confusingly, the designer surface is labeled Data Source View. Weren’t we just working in
a separate designer for that object? Yes, we were. Here’s the difference. In the previous
DSV designer, you selected tables, columns, and relationships to be made available from
source data for loading into an OLAP cube destination structure. In the Cube Structure tab
Data Source View designer, you can review the results of that data load. You can also make
changes to the measures and dimensions on the cube side.

We’ll give you a concrete example to help clarify. Rows from the fact source tables become
measures in the OLAP cube. Fact rows can originate in one of two ways. The first way is to
do a straight data load by simply copying each row of data from the source fact table. The
second way is to originate it as a derived fact. This is a calculation that is applied at the time
of cube load, based on a query language that the source RDBMS understands. So, if you’re
using SQL Server, a Transact-SQL query can be used to create a calculated value. This type of
action is defined in the DSV. The calculation is performed when the source data is loaded—
that is, when it is copied and processed into the destination OLAP cube structure and the
resultant calculated values are stored on disk in the OLAP cube.

You also have the option of creating a calculated measure on the OLAP cube side. This is
defined using MDX and is calculated at OLAP query time and not stored in the OLAP cube.
You use the DSV area on the Cube Structure tab to create regular measures (that is, measures
based on any column in any fact table in the referenced DSV) or calculated measures.
Chapter 7 Designing OLAP Cubes Using BIDS 203

Calculated measures are quite common in cubes. The SSAS engine is optimized to process
them quickly at query time. They are used when you have measures that will be needed by a
minority of users and when you want to keep space used on disk to a minimum. You’ll often
have 50 or more calculated measures in a single production cube. Calculated measures are
indicated in the metadata tree by a small, blue square with the function notation (that is, fx)
on the measure icon. This is shown in Figure 7-18.

FIgUre 7-18 Calculated measures are created using MDX expressions and are indicated by the fx notation.

When you’re working on the designer surface in the cube designer, if you want to make any
structural changes to the underlying DSV, you can. You simply right-click on any table that
you’ve added to this view, click Edit Data Source View to open the DSV editor, and make your
changes there. So you can think of the data source view area in the cube designer as a viewer
rather than an editor. This chained editing is common throughout BIDS. Because of this, it
can be quite easy to get lost at first! We advise you to start out slowly. Once you get the
hang of it, chained editing will make sense and save you time.

You’ll also see the chained editing paradigm repeated in the lower half of the metadata
browser, the Dimensions section. Here you can see the dimension names, hierarchy names,
and level names. However, you cannot browse the dimensional data or edit the dimension
structure. So what is the purpose of listing the dimension information here if you can’t make
any structural changes? And just where do you make structural changes?

The configuration options available to dimensions, hierarchies, and attributes (levels) on


the Cube Structure tab are limited to specific options regarding how the cube will use that
dimension, some of which have an effect on processing. We haven’t begun to talk about
cube processing yet, so we’ll cover these options in detail later in this book.

So where do you develop dimensions? You use the dimension editor. As you might have
guessed by now, you can take a shortcut directly to the dimension editor from the Cube
Structure tab. In the Dimensions section, click the link in the particular dimension you want to
edit. In Figure 7-19, we’ve opened the Customer dimension to show the Edit Customer link, as
well as the Customer Geography hierarchy and the Attributes container.
204 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-19 The Dimensions section on the Cube Structure tab contains links to the dimension editor.

Expand the Customer dimension, and click the Edit Customer link in the Dimensions sec-
tion to open the dimension editor workspace in BIDS. We’ll now take a slight detour into the
world of dimensions. After we complete this, we’ll return to the cube designer.

Understanding Dimensions
So far, we’ve addressed the underlying OLAP cube as a single star schema—that is, fact tables
plus dimension tables. This is a bit of an oversimplification. Although it’s possible to base an
OLAP cube on such a simple structure, in our experience business requirements and source
data often introduce complexities. One of those complexities is the need for distinct sets of
permissions or processing settings at the cube level. If this in an unavoidable case (and one
that is justified by business requirements), you can create multiple OLAP cubes. Because this
situation is common, we usually begin our cube design with dimensions and then proceed to
measures. Our goal is always to create a single cube. If that is not practical, the shared nature
of dimensions is quite useful.

Let’s provide a business example to give some context to this discussion. Suppose that you
have requirements to create OLAP solutions for two distinct end-user communities for a
retail chain. These communities are financial analysts and store management. The complex-
ity of what you choose to present will vary greatly for each group. Although you could create
views of a single cube (called perspectives), you might also have different data update fre-
quency requirements. For example, the analysts might require data current as of the previous
day, and managers might require data current as of the previous hour. As a result of these
Chapter 7 Designing OLAP Cubes Using BIDS 205

varying requirements (and, often, as a result of other requirements beyond these), you elect
to create two separate cubes. Rather than starting with fact tables and measures, you should
start by attempting to create common dimensions. Examples of such often include Time,
Customers, and Products dimensions.

Tip As a rule of thumb, when designing and building cubes, start by building the dimensions.

Given this example, let’s now examine the dimension editor in BIDS. We’ve opened the
Customer dimension from the Adventure Works DW 2008 cube sample for our discussion. As
with the cube designer, when you open the dimension editor, you see the tab and embedded
toolbar structure that is common to BIDS. We show this in Figure 7-20. The tab names here
are Dimension Structure, Attribute Relationships, Translations, and Browser. As with the cube
designer, the only tab we’ve examined to this point in the process is the Browser tab.

FIgUre 7-20 The Dimension Structure metadata tab in the dimension editor.

The Dimension Structure tab contains three sections: Attributes, Hierarchies, and Data
Source View. As with the Cube Structure tab, the Data Source View section on the Dimension
Structure tab lets you view only the source table or tables that were used as a basis for creat-
ing this dimension. If you want to make any changes in those tables (for example, adding cal-
culated columns) you right-click on any table in that section and then click Edit Data Source
View, which opens that table in the main BIDS DSV editor.

The Attributes section shows you a list of defined attributes for the Customer dimen-
sion. Note that these attributes mostly have the same names as the source columns from
the Customer table. In this section, you can view and configure properties of dimensional
attributes.

As you build a dimension, you might notice squiggly lines underneath some attributes. This is
the new Analysis Management Objects (AMO) design warning system in action. Although you
can’t see color in a black-and-white book, these lines are blue when viewed on your screen.
If you hover your mouse over them, the particular design rule that is being violated appears
in a tooltip. As mentioned earlier, these warnings are for guidance only; you can choose to
ignore them if you want. Microsoft added these warnings because many customers failed
to follow best OLAP design practices when building cubes in SSAS 2005 and this resulted in
cubes that had suboptimal performance in production environments.

During our discussion of DSVs, we mentioned a new design approach in BIDS 2008—one
based on exclusivity rather than inclusivity. This approach has also been applied to dimension
206 Part II Microsoft SQL Server 2008 Analysis Services for Developers

design. Although you’ll still use a wizard to create dimensions, that wizard will reference
only selected source columns as attributes. In the past, the wizard automatically referenced
all source columns and also attempted to auto-detect hierarchies (or summary groupings)
of these attributes. You must now manually create all hierarchies and relationships between
attributes in those hierarchies.

Attribute hierarchies were discussed earlier; however, some information bears repeating here.
Missing and improper hierarchy definitions caused poor performance in many cubes built
using SSAS 2005.

Attribute Hierarchies
Rollup or dimension attribute hierarchy structure design should be driven by business
requirements. That is, by building structures that reflect answers to the question, “How do
you roll up your data?” Time is probably the simplest dimension to use as an example. After
determining the lowest level of granularity needed—that is days, hours, minutes, and so on—
the next question to ask is, “How should the information be rolled up?” Again, using the Time
dimension, the answer is expressed in terms of hours days, weeks, months, and so on.

Creating appropriate attribute hierarchies makes your cubes more usable in a number of
ways. First, end users will understand the data presented and find it more useful. Second, if
the data is properly designed and optimized (more about that shortly), cube query perfor-
mance will be faster. Third, cube processing will be faster, which will make the cube available
more frequently.

SSAS supports two types of attribute hierarchies: navigational and natural. What do these
terms mean, and what are the differences between them? Navigational hierarchies can be
created between any attributes in a dimension. The data underlying these attributes need
not have any relationship to one another. These hierarchies are created to make the end
user’s browsing more convenient. For example, you can design the Customer dimension in
your cube to be browsed by gender and then by number of children.

Natural hierarchies are also created to facilitate end-user browsing. The difference between
these and navigational hierarchies is that in natural hierarchies the underlying data does have
a hierarchical relationship based on common attribute values. These relationships are imple-
mented through attribute relationships, which are discussed in the next section. An example
of this is in our Date dimension—months have date values, quarters have month values, and
so on.

Because of the importance of this topic and because many customers weren’t successful
using BIDS 2005, Microsoft has redesigned many parts of the dimension editor. One place
you’ll see this is in the Hierarchies section of the Dimension Structure tab of the editor, which
is shown in Figure 7-21. Notice that the Date dimension has four hierarchies defined, two of
Chapter 7 Designing OLAP Cubes Using BIDS 207

which are visible in Figure 7-21: Fiscal and Calendar. Creating more than one date hierarchy is
a common strategy in production cubes.

FIgUre 7-21 The dimension editor Hierarchies section lists all hierarchies for a particular dimension.

The areas where you work to create different types of hierarchies has changed in BIDS 2008.
There are two types of hierarchies: navigational (where you can relate any attributes) and
natural (where you relate attributes that have an actual relationship in the source data, usu-
ally one-to-many). In BIDS 2005, you created and configured hierarchies in the Hierarchies
section and attribute relationships in the Attributes section. In BIDS 2008, you still create
hierarchies here by dragging and dropping attribute values from the Attributes section. You
can also rename and configure some other properties. However, Microsoft has created a
new attribute relationship designer in the dimension editor to help you to visualize and build
effective attribute relationships between values in your dimensional hierarchies.

Attribute Relationships
Before we examine the new attribute relationship designer, let’s take a minute to define
the term attribute relationship. We know that attributes represent aspects of dimensions.
In the example we used earlier of the Date dimension (illustrated by Figure 7-21), we saw
that we have values such as Date, Month Name, Fiscal Quarter, Calendar Quarter, and so
on. Hierarchies are roll-up groupings of one or more attributes. Most often, measure data is
aggregated (usually summed) in hierarchies. In other words, sales amount can be aggregated
by day, then by month, and then by fiscal quarter.

Measure data is loaded into the cube via rows in the fact table. These are loaded at the low-
est level of granularity. In this example, that would be by day. It’s important to understand
that the SSAS query processing engine is designed to use or calculate aggregate values of
measure data at intermediate levels in dimensional hierarchies. For example, it’s possible that
an MDX query to the Fiscal Quarter level for Sales Amount could use stored or calculated
aggregations from the Month level.
208 Part II Microsoft SQL Server 2008 Analysis Services for Developers

If you’re creating natural hierarchies, the SSAS query engine can use intermediate aggrega-
tions if and only if you define the attribute relationships between the level members. These
intermediate aggregations can significantly speed up MDX queries to the dimension. To that
end, the new Attribute Relationships tab in the dimension editor lets you visualize and con-
figure these important relationships correctly. Figure 7-22 shows this designer for the Date
dimension.

FIgUre 7-22 The Attribute Relationships tab is new in BIDS 2008.

The main designer shows you the attribute relationship between the various levels in the
defined hierarchies. The bottom left section lists all defined attributes for this dimension. The
bottom right section lists all relationships. Attribute relationships have two possible states:
flexible or rigid. Flexible relationships are indicated by an open arrow (outlined), and rigid
relationships are indicated by a solid black arrow.

The state of the attribute relationship affects how SSAS creates and stores aggregations when
the dimension is processed. Dimension processing means that data is loaded from source
locations into the dimension destination structure. If you define a relationship as rigid, previ-
ously calculated aggregations are retained during subsequent dimension processing. This
will, of course, speed up dimension processing. You should only do that when you do not
expect dimension data to change. Date information is a good example of rigid data, and
you’ll note that all relationships have been defined as rigid in all hierarchies. In other words,
if you never expect to update or delete values from source data, as would be the case in a
date hierarchy between, for example, month and year names, you should define the attribute
relationship as rigid. Inserting new data is not considered a change in this case, only updat-
ing or deleting data. On the other hand, if data could be updated or deleted—for example, in
the case of customer surnames (women getting married and changing their surnames)—you
should define the attribute relationship as flexible.
Chapter 7 Designing OLAP Cubes Using BIDS 209

For more information about this topic, see the “Defining Attribute Relationships” topic in SQL
Server Books Online.

Modeling attribute relationship types correctly also relates back to dimensional data model-
ing. As you might recall, we discussed the importance of capturing business requirements
regarding whether or not changes to dimension data should be allowed, captured, or both.
As mentioned, an example of dimension data that is frequently set to changing or flexible is
the Last Name attribute (for customers, employees, and so on). People, particularly women,
change their last names for various reasons, such as marriage and divorce.

Translations
The Translations tab allows you to provide localized strings for the dimension metadata. In
Figure 7-23, we’ve collapsed the individual Attributes localizations so that you can see that
you can provide localized labels for defined attribute hierarchy level names as well. We’ll talk
a bit more about localization of both dimension and measure data in Chapter 9, “Processing
Cubes and Dimensions.” Keep in mind that what you’re doing in this section is localizing
metadata only—that is, labels. BIDS actually has some nifty built-in wizards to facilitate data
localization (particularly of currency-based measures). We’ll cover that topic in the next
chapter.

FIgUre 7-23 The Translations tab allows you to provide localized labels for dimension metadata.

Also, note that you can preview your dimension localization on the Browser tab by selecting
one of the localization types in the Language drop-down menu. In our case, we’ve selected
Spanish, as shown in Figure 7-24.
210 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-24 The dimension editor Browser tab allows you to preview any configured localizations.

Now that we’ve taken a first look at dimension building, we’ll return to our tour of the sample
OLAP cube. Before we do so, you might want to review the rest of the dimensions used in
the Adventure Works DW 2008 sample SSAS project. Nearly all attribute hierarchy relation-
ship types are represented in the sample. Reviewing this sample will also give you perspec-
tive before we move to the next part of our explanation, where we’ll look at how the defined
dimensions will be used in an OLAP cube. To do this, you double-click on the Adventure
Works sample cube in the Cubes folder in Solution Explorer to open the cube designer. Then
click the Dimension Usage tab. We’ll continue our tour there.

Using Dimensions
After you’ve created all the dimensions (including defining hierarchies and attribute rela-
tionships) you need to meet your business requirements, you’ll begin to combine those
dimensions with measures (derived from fact tables) to build OLAP cubes. Microsoft made a
major change to the possible cube structure in SSAS 2005. In our experience, most custom-
ers haven’t fully grasped the possibilities of this change, so we’ll take a bit of time to explain
it here.

In classic dimensional source modeling for OLAP cubes, the star schema consists of exactly
one fact table and many dimension tables. This was how early versions of Microsoft’s OLAP
tools worked as well. In other words, OLAP developers were limited to basing their cubes on
a single fact table. This design proved to be too rigid to be practical for many business sce-
narios. So starting with SSAS 2005, you can base a single cube on multiple fact tables that are
related to multiple dimension tables. This is really a challenge to visualize! One way to think
of it is as a series of flattened star schemas. Multiple dimension tables can be used by mul-
tiple cubes, with each cube possibly having multiple fact tables.
Chapter 7 Designing OLAP Cubes Using BIDS 211

So how do you sort this out? Microsoft has provided you with the Dimension Usage tab in
the cube designer, and we believe it’s really well designed for the task. Before we explore the
options on the Dimension Usage tab, let’s talk a bit more about the concept of a measure
group. You might recall from the Cube Structure tab shown in Figure 7-16 that measures are
shown in measure groups. What exactly is a measure group?

Measure Groups
A measure group is a container for one or more measures. Measure groups are used to relate
groups of measures to particular dimensions in an OLAP cube. For this reason, all measures
common to a group need to share a common grain—that is, the same set of dimensions at
the same level. What that means is that if the measures Internet Sales Amount and Internet
Order Quantity both need to expose measure values at the “day” grain level for the Date
dimension, they can be placed in a common measure group. However, if Internet Sales
Amount needs to be shown by hours and Internet Order Quantity needs to be shown only
by days, you might not put them in the same measure group. Just to complicate the situation
further, you could still put both measures in the same group if you hide the hourly value for
the Internet Sales Amount measure! This would give them the same set of dimensions, at the
same level.

You might recall that a measure is created in one of three ways. It can be simply retrieved
from a column in the fact table. Or it can be derived when the cube is loaded via a query
to the data source (for example, for SQL Server via a Transact-SQL query). Or a measure can
be calculated at cube query time via an MDX query. The last option is called a calculated
measure. It’s common to have dozens or even hundreds of measures in a single cube. Rather
than forcing all that data into a single table, SSAS supports the derivation of measures from
multiple-source fact tables.

Multiple-source fact tables are used mostly to facilitate easier cube maintenance (principally
updating). For example, if SQL Server 2008 source tables are used, an administrator can take
advantage of relational table partitioning to reduce the amount of data that has to be physi-
cally operated on during cube processing.

In the case of the Adventure Works cube, you’ll see by reviewing the source DSV in the
designer that seven fact tables are used as source tables. Three of these fact tables are cre-
ated via Transact-SQL queries. (Similar to relational views, the icon in the metadata viewers
indicates that the tables are calculated.) So there are four physical fact tables that underlie
the 11 measure groups in the Adventure Works cube.

Armed with this knowledge, let’s take a look at the Dimension Usage tab of the OLAP cube
designer. A portion of this is shown in Figure 7-25. The first thing to notice is that at the inter-
section of a particular dimension and measure group, there are three possibilities. The first is
that there is no relationship between that particular dimension’s members and the measures
212 Part II Microsoft SQL Server 2008 Analysis Services for Developers

in the particular measure group. This is indicated by a gray rectangle. An example of this
is the Reseller dimension and the Internet Sales measure group. This makes business sense
because at Adventure Works, resellers are not involved in Internet sales.

FIgUre 7-25 The Dimension Usage tab allows you to configure relationships between dimensions and
measure groups.

The next possibility is that there is a regular or star-type relationship between the dimen-
sion and measure data. This is indicated by a white rectangle at the intersection point with
the name of the dimension shown on the rectangle. An example of this is shown for the Date
dimension and the Internet Sales measure group.

The last possibility is that there is some type of relationship other than a standard star (or
single dimension table source) between the dimension and the measure group data. This is
indicated by a white rectangle with some sort of additional icon at the intersection point. An
example of this is at the intersection of the Sales Reason dimension and the Internet Sales
measure group.

To examine or configure the type of relationship between the dimension data and the mea-
sure group data, you click on the rectangle at the intersection point. After you do so, a small
gray build button appears on the right side of the rectangle. Click that to open the Define
Relationship dialog box. The dialog box options vary depending on the type of relationship
that you’ve selected. The type of relationship is described, and there is a graphic displayed to
help you visualize the possible relationship types.
Chapter 7 Designing OLAP Cubes Using BIDS 213

The possible types of relationships are as follows:

■■ No Relationship Shown by gray square


■■ Regular Star (or single dimension table) source
■■ Fact Source column from a fact table
■■ Referenced Snowflake (or multiple dimension tables) as source
■■ Many-to-Many Multiple source tables, both fact and dimension, as source
■■ Data Mining Data mining model as source data for this dimension

Notice in Figure 7-26 that the Select Relationship Type option is set to Regular. As mentioned,
you also set the Granularity Attribute here. In this case, Date has been selected from the
drop-down list. Note also that the particular columns that create the relation are referenced.
In this case, DateKey is the new, unique primary key in the source DimDate dimension table
and OrderDateKey is the foreign key in the particular source fact table.

FIgUre 7-26 The Define Relationship dialog box allows you to verify and configure relationships between
dimensions and measures.

It’s interesting to note that after you click the Advanced button in the Define Relationship
dialog box, you’ll be presented with an additional dialog box that enables you to configure
null processing behavior for the attributes (levels) in the dimension. Note also that if you’ve
defined a multiple column source key for the attribute, this is reflected in the Relationship
section of this advanced dialog box as well, as shown in Figure 7-27.
214 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-27 The Measure Group Bindings dialog box allows you to define null processing behavior.

Your choices for controlling null processing behavior at the attribute level are as follows:

■■ Automatic (the default) Converts numeric nulls to 0 and string nulls to empty strings
for OLAP cubes, and follows the configured UnknownMember property behavior for
DM structures.
■■ Preserve Preserves the null value as null. (We do not recommend using this setting.)
■■ Error Attempts to load nulls generate an exception. If it’s a key value (or a primary key
from the source table), the configured value of NullKeyNotAllowed determines behavior
(possibilities are IgnoreError, ReportAndContinue, and ReportAndStop).
■■ UnknownMember Relies on a two-dimensional property setting—the Unknown­
Member visibility property (set by default to None, but configurable to Visible or
Hidden), and the UnknownMemberName property (set to the string Unknown by
default).
■■ ZeroOrBlank Same as Automatic for OLAP cubes.

We would be remiss if we didn’t mention that it’s a much better practice to trap and elimi-
nate all nulls during your ETL process so that you don’t have to consider the possibility of
nulls while loading your dimensions and measures.
Chapter 7 Designing OLAP Cubes Using BIDS 215

Beyond Star Dimensions


We hope that you’ve considered our advice and will base the majority of your dimensions on
a single source table so that you can use the simple star schema design (defined as Regular)
described earlier. If, however, you have business-justified reasons for basing your dimensions
on non-star designs, many of these varieties can be supported in SSAS. Let’s take a look at
them, in priority order of use.

Snowflake Dimension
The most common variation we see in production cubes is the snowflake design. As men-
tioned, this is usually implemented via a dimension that links to the fact table through
another dimension. To establish this type of relationship on the Dimension Usage tab of the
cube designer, you simply select the Referenced relationship type. After you do that, you
need to configure the dialog box shown in Figure 7-28. In the example shown, we’re defining
the relationship between the Geography dimension and the Reseller Sales measure group. To
do this, we choose to use an intermediate dimension table (Reseller).

FIgUre 7-28 The Define Relationship dialog box allows you to define referenced relationships.

The most common business reason for using separate tables deals with the changeability and
reuse of the data. In our example, geography information is static and will be reused by other
dimensions. Reseller information is very dynamic. So the update behavior is quite different
for these two source tables.
216 Part II Microsoft SQL Server 2008 Analysis Services for Developers

To create a referenced (or snowflake) dimension, you must model the two source tables with
a common key. In this case, the primary key of the Geography table is used as a foreign key
to the Reseller table. Of course, there must also be a key relationship between the Reseller
dimension table and the Reseller Sales fact table. That attribute (key value) is not shown
here. If you want to examine that relationship, you can either add all three tables to the Cube
Structure tab of the Data Source View section on the Cube Structure tab or open the DSV for
the cube.

Finally, note that the Materialize option is selected, which is the default setting. This setting
tells the SSAS instance to persist any aggregations (that are designed during processing) to
disk when the dimension is processed. This option is on by default so that better MDX query
performance can be realized when the dimension is queried.

Fact Dimension
A fact dimension is based on an attribute or column from the fact table. The most common
business case is modeled in the sample—that is, order number. You simply set the relation-
ship type to Fact by selecting that value from the Select Relationship Type drop-down list in
the Define Relationship dialog box, and then select the source column from the selected fact
table. Although it’s easy to implement, this type of dimension can have a significant impact
on your cube performance, as the following paragraphs explain.

Usually fact tables are far larger (that is, they contain more rows) than dimension tables. The
reason for this should be obvious. Let’s take the example of sales for a retail store. A success-
ful business should have customers who make many repeat purchases, so there are relation-
ships of one customer to many purchases, or one row in a (customers) dimension table to
many rows in a (sales transactions) fact table.

For this reason, anything that makes your source fact tables wider—that is, adds columns to
add information about each order—should have a business justification in your model. As we
mentioned earlier, it’s a best practice to model your fact tables narrowly for best load and
query performance. To summarize, although you can easily use fact columns as dimensions,
do so sparingly.

Many-to-Many Dimension
The complex many-to-many option, which was added at the request of enterprise customers,
extends the flexibility of a model but can be tricky to use. In particular, strict adherence to
a source modeling pattern makes implementing this type much clearer. First, we’ll give you
the business case from Adventure Works DW 2008 as a reference. Shown in Figure 7-29 is the
configuration dialog box for this option.
Chapter 7 Designing OLAP Cubes Using BIDS 217

FIgUre 7-29 The many-to-many relationship type is quite complex and should be used sparingly.

In the sample, the many-to-many relationship is used to model Sales Reasons for Internet
Sales. It works nicely for this business scenario. To better understand the required model-
ing, we’ll show the three source tables involved in this relationship in the DSV for the cube:
Internet Sales Facts, Internet Sales Reason Facts, and Sales Reason.

It’s quite easy to see in Figure 7-30 that Internet Sales Reason Facts functions similar to a
relational join or junction table. It contains only keys—a primary key and foreign keys for
both the Internet Sales Facts and Sales Reason tables. The join (or intermediate fact table as
it’s called here) establishes a many-to-many relationship between sales reasons and particular
instances of Internet sales. In plain English, the business case is such that there can be more
than one sales reason for each Internet sale, and these sales reasons are to be selected from a
finite set of possible reasons. We’ll cover the last type of dimension relationship, data mining,
in Chapter 13, “Implementing Data Mining Structures.”

Although we’ve not yet looked at all aspects of our sample cube, we do have enough infor-
mation to switch back to our disconnected, blank instance and build our first simple cube. In
doing that, we’ll finish this chapter. In the next chapter, we’ll return to reviewing our sample
to cover the rest of the tabs in the BIDS cube designer: Calculations, KPIs, Actions, Partitions,
Aggregations, Perspectives, and Translations. We’ll also cover the configuration of advanced
property values for SSAS objects in the next chapter. We realize that by now you’re probably
quite eager to build your first cube!
218 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgUre 7-30 The many-to-many relationship type requires two source fact tables. One is a type
of junction table.

Building Your First OLAP Cube


We’ll now work in the blank disconnected instance and create a cube based on the
AdventureWorksDW2008 relational source star schema and data source view that you’ve
already created. Although we took quite a bit of time to get to this point, you’ll probably be
surprised at how easy creating a cube using BIDS actually is. The interface is really quite easy
to use after you gain an understanding of OLAP concepts and modeling. This is why we’ve
taken such a long time to get to this point. Probably the biggest issue we saw with customers
using SSAS 2005 was overeagerness to start building coupled with a lack of OLAP knowledge.
This often produced undesirable results. In fact, we were called on more than one occasion
to fix deployed cubes. Most often these fixes were quite expensive. Usually, we started over
from scratch.

We’ve armed you with quite a bit of knowledge, so we expect your experience will be
much more pleasant and productive. We’ll launch the Cube Wizard from the Cube folder in
Solution Explorer by right-clicking on that folder. You should choose the Use Existing Tables
option from the DSV you created earlier.
Chapter 7 Designing OLAP Cubes Using BIDS 219

Selecting Measure Groups


As you work with the Cube Wizard pages, you need to select measure group tables from the
list of tables included in the DSV first. If you’re unsure which tables to select, you can click the
Suggest button on the Select Measure Group Tables page. The results of clicking that button
are shown in Figure 7-31.

FIgUre 7-31 Select Measure Group Tables page of the Cube Wizard contains a Suggest button.

Although the suggest process is fairly accurate, you’ll still want to review and adjust the
results. In this case, all source tables whose names include the word Fact, as well as most
of the views, were selected. We’ll clear the check boxes for all the selected views (vAssoc-
SeqLineItems, vAssocSeqOrders, vDMPrep, and vTargetMail), ProspectiveBuyer, and the
DimReseller table. Then we’ll proceed to the next page of the wizard.

This page displays all possible measures from the previously selected measure group tables.
You’ll want to scroll up and down to review the measures. In a production situation, you’d
want to make sure that each selected measure was included as a result of business require-
ments. Consider our earlier discussion about the best practice of keeping the fact tables as
narrow as possible so that the size of the table doesn’t become bloated and adversely affect
cube query and cube processing times. For our sample, however, we’ll just accept the default,
which selected everything, and proceed to the next step of the wizard.
220 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Adding Dimensions
On the next page of the wizard, BIDS lists required dimensions from the source DSV. These
dimensions are related to the measures that you previously selected. If you attempt to clear
the check box for one of the required dimensions, the wizard displays an error that explains
which measure requires the particular dimension that you’ve cleared (so that you can click the
Back button in the wizard and reselect that measure if you want).

The wizard does not allow you to proceed until all required dimensions are selected. As
mentioned previously, hierarchies are not automatically created. We did see an exception
to this in that the product/subcategory/category snowflake dimension was recognized as a
snowflake and created as such. Figure 7-32 shows some of the dimensions that the wizard
suggested.

FIgUre 7-32 The Cube Wizard suggests dimensions based on the measures you previously selected.

On the last page of the wizard, you can review the metadata that will be created. It’s also
convenient to remember that at any point you can click the Back button to make corrections
or changes to the selections that you made previously. Give your cube a unique name, and
click Finish to complete the wizard. We remind you that you’ve just generated a whole bunch
of XMLA metadata. If you want to review any of the XMLA, right-click on any of the newly
created objects (that is, cubes or dimensions) in Solution Explorer and then click View Code.
Chapter 7 Designing OLAP Cubes Using BIDS 221

You’ll probably want to review the Dimension Usage tab of the cube designer. Do you
remember how to open it? That’s right, just double-click the cube name in Solution Explorer,
and then click the Dimension Usage tab. Take a look at the how the Cube Wizard detected
and configured the relationships between the dimensions and the measures. It’s quite accu-
rate. You’ll want to review each configuration, though, particularly when you first begin to
build cubes.

Remember that the AdventureWorksDW2008 relational database source is modeled in a way


that works very well with the Cube Wizard. This is particularly true in terms of relationships.
The reason for this is naming conventions for key columns. The same name is used for the
primary key column and the foreign key column, so the wizard can easily identify the correct
columns to use. In cases where the key column names don’t match, such as between Dim
Date (DateKey) and Fact Internet Sales (OrderDateKey), the wizard won’t automatically find
the relationship. You should follow these design patterns as closely as you can when prepar-
ing your source systems.

You’ll also want to make sure that you understand all the non–star dimension relationships
that were automatically detected. In our sample, we see several referenced (snowflake) and
one fact (table source) relationship in addition to a couple other types. You’ll have to pay par-
ticular attention to correct detection and configuration of non–star dimensional relationships
because the wizard doesn’t always automatically detect all these types of relationships.

If, for some reason, you need to create an additional dimension, you do that by right-clicking
the dimension folder in Solution Explorer. This launches the New Dimension Wizard. You
should find this tool to be easy to use at this point. You select a source table, which must
have a primary key, confirm any related tables that might be detected (snowflake), and then
select the source columns that will become attributes in the new dimension. To add the new
dimension to an existing cube, simply open the cube in the designer (on either the Cube
Structure or Dimension Usage tab), click Add Cube Dimension on the nested toolbar, and
then select the new dimension name from the dialog box.

You might be anxious to see the results of your work. However, on the Browser tab of the
cube designer, you’ll see an error after you click the Click Here For Detailed Information link.
The error message will say either “The user, <username>, does not have access to Analysis
Services Project x database,” or “The database does not exist.” Do you know which step we
haven’t completed yet and why you’re seeing this error?

You must build and deploy the cube before you can see the source data loaded into the new
metadata structure that you just created. Although you might guess (correctly) that to build
and deploy you could just right-click on the new cube you created in Solution Explorer, you
should hold off on doing that—we’ve got more to tell you about configuring and refining the
cube that you’ve just built. We’ll do that in the next few chapters.
222 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Here are a couple of questions to get you thinking. Take a closer look at the cube and dimen-
sions that you just built. Look closely at the dimension structure. If you open some of the
dimensions that were created using the Cube Wizard, you’ll see that they look different than
the dimensions in the Adventure Works sample cube that we looked at earlier. Something is
missing in our new dimensions. Do you know what it is? The hierarchies haven’t been built.

A good example of this is the Dim Date dimension. If you open this dimension in the dimen-
sion editor, there is only one attribute and no hierarchies, as shown in Figure 7-33. Do you
know the steps to take to create a date dimension in your new cube that is structurally similar
to the Date dimension from the Adventure Works DW 2008 sample?

FIgUre 7-33 The Dimension Structure tab contains no attribute hierarchies by default.

Did you guess that your first step is to add more attributes to the dimension? If so, that’s
correct! An easy way to do that is to click on the source columns from the DimDate table
and then drag those column names to the Attributes section of the designer. To then create
hierarchies of those attributes, you click on the attribute names in the Attributes section and
drag them to the Hierarchies section. When you’ve done this, the dimension might look like
Figure 7-34. However, notice that we have one of those pesky blue squiggles underneath the
hierarchy name. The pop-up warning text reads, “Attribute relationships do not exist between
one or more levels of this hierarchy. This may result in decreased query performance.” Do you
know how to fix this error?
Chapter 7 Designing OLAP Cubes Using BIDS 223

FIgUre 7-34 The Dimension Structure tab displays an AMO design warning if no attribute relationships are
detected.

If you guessed that you need to use the Attribute Relationship tab to configure the relation-
ship, you are correct (again!). Interestingly, when you view the Attribute Relationship tab,
you’ll see that the default configuration is to associate every attribute directly with the key
column. To fix this, right-click on the node for Calendar Quarter on the designer surface and
then click New Attribute Relationship. You’ll see a dialog box in which you can configure the
relationship and then specify the type of relationship (flexible or rigid). Which type of rela-
tionship will you choose for these attributes? You’ll likely choose rigid here, as date informa-
tion is usually not changeable.

We have much more to do, but this has been a long chapter. Take a break, fuel up, and con-
tinue on with us in the next chapter to dig even deeper into the world of OLAP cubes.

Summary
We’ve started our tour of BIDS. We’re certainly nowhere near finished yet. In this chapter,
we looked at working with the BIDS SSAS templates. We examined the core objects: data
sources, data source views, roles, and assemblies.

Next we’ll look at the cube and dimension builders. We’ve got lots more to do, so we’ll move
to advanced cube building in the next chapter. Following that, we’ll take a look at building
a data mining structure. Along the way, we’ll also cover basic processing and deployment so
that you can bring your shiny new objects to life on your SSAS server.
Chapter 8
Refining Cubes and Dimensions
Now that you’ve developed your first OLAP cube, it’s time to explore all the other goodies
available in Microsoft Business Intelligence Development Studio (BIDS). You can make your
base cube much more valuable by adding some of these capabilities. Be aware that none of
them are required, but, most often, you’ll choose to use at least some of these powerful fea-
tures because they add value for your end users. There’s a lot of information to cover here.
In this chapter, we’ll look at calculated members, key performance indicators (KPIs), enabling
writeback, and more. We’ll also look at adding objects using the Business Intelligence Wizard.
Finally, we’ll look at advanced cube, measure, and dimension properties.

Refining Your First OLAP Cube


As we get started, we’ll continue working through the OLAP cube designer tabs in BIDS.
To do this, we’ll continue the pattern we started in the previous chapter. That is, we’ll
open two instances of BIDS. In the first instance, we’ll work in connected mode, using the
Adventure Works sample OLAP cube. In the second instance, we’ll work in offline (or discon-
nected) mode. For these advanced objects, we’ll first look at what has been provided in the
Adventure Works sample cube, and then we’ll create these objects using BIDS.

Although we’ll begin to work with the native OLAP query language, MDX, in this chapter,
our approach will be to examine generated MDX rather than native query or expression writ-
ing. The reason for taking this approach is that, as mentioned, MDX is quite complex, and
it’s a best practice to thoroughly exhaust the tools and wizards inside BIDS to generate MDX
before you attempt to write MDX statements from scratch.

Note In Chapters 10 and 11, we cover MDX syntax, semantics, expressions, query authoring,
and more. There we examine specifics and best practices for using the MDX language. We review
these additions in order of simple to complex because we find that, when first introduced to
this material, people absorb it best in this fashion. To that end, we’ll start here with something
that should be familiar, because we’ve already covered it with respect to dimensions—that is,
translations.

Translations and Perspectives


Translations for cube metadata function much like translations for dimension metadata. Of
course, providing localized strings for the metadata is really only a small part of localization.

225
226 Part II Microsoft SQL Server 2008 Analysis Services for Developers

You must remember that here you’re translating only the object labels (that is, measure
groups, measures, dimensions, and so on).

When your project has localization requirements, these requirements usually also include
requirements related to localizing the cube’s data. The requirement to localize currency
(money) data is an often-used example. Because this particular translation requirement is
such a common need, Microsoft provides the Currency Conversion Wizard to assist with this
process. This powerful wizard (which is really more of a tool than a wizard) is encapsulated in
another wizard. The metawizard is the Add Business Intelligence Wizard. We’ll review the use
of this tool in general later in the chapter, as well as specifically looking at how the Currency
Conversion Wizard works.

To view, add, or change translations, you use the Translations tab in the OLAP cube designer,
which is shown in Figure 8-1. Translation values can be provided for all cube objects. These
include the cube, its measure groups, and measures, along with dimensions and other
objects, such as actions, KPIs, and calculations.

FiguRe 8-1 The Translations tab of the OLAP cube designer allows you to provide localized strings for cube
metadata.

Note If a cube doesn’t contain a translation for a user’s particular locale, the default language or
translation view is presented to that user.
Chapter 8 Refining Cubes and Dimensions 227

Perspectives are defined subsets of an OLAP cube. They’re somewhat analogous to relational
views in an RDBMS. However, perspectives are like a view that covers the entire cube and
expose only specific measures and dimensions.

Also, unlike working with relational views, you cannot assign specific permissions to defined
perspectives. Instead, they inherit their security from the underlying cube. We find this object
useful for most of our projects. We use perspectives to provide simplified, task-specific views
(subsets) of an enterprise OLAP cube’s data. Perspectives are easy to create using BIDS. You
simply select the items you want to include in your defined perspective. You can select any of
these types of items: measure groups, measures, dimensions, dimensional hierarchies, dimen-
sional attributes, KPIs, actions, or calculations.

It’s important for you to verify that your selected client tools support viewing of cube data
via defined perspectives. Not all client tools support this option. SSAS presents the perspec-
tive to the client tool as another cube. To use the perspective, instead of selecting the actual
cube (Adventure Works, in the sample), you select the perspective (Direct Sales).

To easily view, add, or change perspectives, you use the Perspectives tab in the cube
designer. Figure 8-2 shows a partial list of the defined perspectives for the Adventure Works
sample cube.

FiguRe 8-2 The Perspectives tab of the OLAP cube designer allows you to define subsets of cube data for
particular user audience groups.
228 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Note It’s important to understand that perspectives are not a security mechanism. They are
designed purely as a convenience for you to create simplified views of a particular cube. Security
permissions assigned to the underlying objects—that is, cube and dimensions, and so on—are
enforced when a user browses any defined perspective.

In our experience, translations and perspectives are specialized options. In the real world,
we’ve implemented these two features only when project specifications call for them. Some
of our clients prefer to refrain from using perspectives entirely, while others quite like them.
Translations are used when localization is part of the project. The next feature we cover, how-
ever, is one that nearly 100 percent of our clients have used.

Key Performance Indicators


KPIs are core metrics and measurements related to the most important business analytics.
In our experience, we’ve often heard them referred to as “the one (unified) view of the truth
for a business.” KPIs are often displayed to end users via graphics on summary pages, such as
dashboards or scorecards. Although you set up KPIs to return results as numbers—that is, 1 is
good, 0 is OK, and –1 is bad—you generally display these results as graphics (such as a traffic-
light graphic with red, yellow, or green selected, or as different colors or types of arrows). The
returned number values aren’t as compelling and immediate as the graphics to most users.
The KPIs tab in the OLAP cube designer in BIDS has the built-in capacity to display several
types of graphics instead of numbers. The KPIs tab includes both a design area and a pre-
view (or browse) area. An important consideration when including KPIs in your OLAP cube is
whether or not your selected end-user client applications support the display of KPIs. Both
Microsoft Office Excel 2007 and Microsoft Office SharePoint Portal Server 2007 support the
display of SSAS OLAP cube KPIs. The reason we make this distinction is that both Excel and
SharePoint Portal Server support display of KPIs from multiple sources—OLAP cubes, or Excel
or SharePoint Portal Server.

Tip We recommend that if your business requirements include the use of KPIs in your solution,
you create as many of these KPIs in the OLAP cube (rather than in Excel, SharePoint Portal Server,
and so on). This development approach better maintains that uniform view of the truth that is
important for businesses.

Open the sample Adventure Works cube in BIDS, and click on the KPIs tab. We’ll first use
the KPI browser to get an idea of what the sample KPIs are measuring and how they might
appear in a client dashboard. The default view of KPIs is the design view. To open the browser
view, click the tenth button from the left (the KPI icon with magnifying glass on it) on the
embedded toolbar, as shown in Figure 8-3.
Chapter 8 Refining Cubes and Dimensions 229

FiguRe 8-3 The BIDS KPI designer includes a KPI viewer.

KPIs consist of four definitions—value, goal, status, and trend—for each core metric. These
metrics are defined for a particular measure. Recall that each measure is associated with one
particular measure group. The information for these four items is defined using MDX. At
this point, we’ll simply examine generated MDX. (In future chapters, we’ll write MDX from
scratch.)

Following is a list of definitions and example statements for the most frequently used items
in KPIs:

■■ Value MDX statement that returns the actual value of the metric. For a KPI called
Revenue, this is defined as [Measures].[Sales Amount].
■■ Goal MDX statement that returns the target value of the metric. For a KPI called
Revenue, this is defined using an MDX Case…When…Then expression.
■■ Status MDX statement that returns Value – Goal as 1 (good), 0 (OK), or –1 (bad).
Again, this is defined using an MDX Case…When…Then expression.
■■ Trend MDX statement that returns Value – Goal over time as 1, 0, or –1. Similar to
both the Goal and Status values, this also uses an MDX Case expression to define its
value. (This statement is optional.)

Figure 8-4 shows the KPIs defined in the Adventure Works sample OLAP cube. Note that in
addition to showing the actual amounts for the Value and Goal fields, the browser shows a
graphical icon to represent state for the Status and Trend fields. Also, this browser allows you
to set filters to further test your KPIs. In the example in Figure 8-4, we’ve added a filter to
show results for customers who live in Australia.

You can also nest KPIs—that is, create parent and child KPIs. If you do this, you can assign a
Weight value to the nested KPI. This value shows the child’s percentage of the total contribu-
tion to the parent KPI value. An example from this sample is the Operating Profit KPI, which
rolls up to be part of the Net Income KPI.

In addition to the sample KPIs that are part of the Adventure Works cube, BIDS includes
many templates for the more common types of KPIs. To see these, click the Form View but-
ton on the embedded toolbar to switch back to design view. Note that, as with other objects,
this designer includes both a metadata viewer and a designer surface. If you take a closer
look at the metadata browser, you can see that it lists the existing KPIs and includes three
sections on the bottom: Metadata, Functions, and Templates. The templates are specific to
KPIs. We’ve opened the metadata browser (shown in Figure 8-5) to show the MDX functions
that are specific to KPI-building tasks.
230 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 8-4 The BIDS KPI browser allows you to see the KPIs that you design.

FiguRe 8-5 The BIDS KPI designer includes a metadata browser.

First, look at the included KPI samples and templates. You’ll notice that they’re easy to cus-
tomize for your particular environment because they both include a sufficient number of
comments and are set up as MDX templates. What we mean by this is that they have double
chevrons (<< some value >>) to indicate replaceable parameter values. Keep in mind, though,
that because of the naming idiosyncrasies of MDX, the more effective way to place object
names into any MDX statement is by dragging and dropping the particular object from the
metadata browser. As you work with the design area, you’ll probably be happy to discover
Chapter 8 Refining Cubes and Dimensions 231

that it color-codes the MDX—for example, all built-in functions are brown, comment code is
green, and so on. Also, the designer includes basic IntelliSense for MDX functions. In Figure
8-6, you can see the template for the Net Income KPI. To add this KPI to the cube, you can
double-click on the Net Income template under the Financial folder in Templates.

FiguRe 8-6 The BIDS KPI designer includes an MDX syntax color-coding function and IntelliSense, and it
supports commenting.

Follow these steps to customize the KPI in our example:

1. Give the KPI a unique name.


2. Select a measure group to associate the KPI with.
3. Define a Value expression in MDX. (A template is provided for us.)
4. Define a Goal expression in MDX. Commented suggestions are provided.
5. Define a Status expression in MDX. A patterned sample MDX code is provided.
6. Optionally, define a Trend expression in MDX. As with the Status expression, a pat-
terned sample MDX code is provided.
232 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The Value, Goal, and Status expressions are self-explanatory. The templated code included for
the Trend value needs more explanation, and it’s shown in the following code sample. This
sample uses the MDX function ParallelPeriod to get a value from the same area of the time
hierarchy for a different time period, such as “this fiscal week last year,” to support the trend.

The ParallelPeriod function works as follows: The function returns a value from a prior period
in the same relative position in the time dimensional hierarchy as the specified value. The
three arguments are a dimension level argument (for example, DimTime.CalendarTime.Years),
a numeric value to say how many periods back the parallel period is, and a specific value or
member (for example, DimTime.CalendarTime.CurrentMember).

ParallelPeriod is one of hundreds of powerful, built-in functions. CurrentMember is another


built-in function used for the calculation of the trend value. It’s simpler to understand. As
you’d probably guess, it retrieves the currently selected member value.

/*The periodicity of this trend comparison is defined by the level at which the
ParallelPeriod is evaluated.*/
IIf
(
KPIValue( “Net Income” ) >
( KPIValue( “Net Income” ),
ParallelPeriod
(
[<<Time Dimension Name>>].[<<Time Hierarchy Name>>].[<<Time Year Level Name>>],
1,
[<<Time Dimension Name>>].[<<Time Hierarchy Name>>].CurrentMember
)
), 1, -1
)

As we start to examine KPIs, you are getting a taste of the MDX query language. At this
point, we prefer to introduce you to KPIs conceptually. In the chapters dedicated to MDX
(Chapters 10 and 11), we’ll provide you with several complete KPI examples with the com-
plete MDX syntax and a fuller explanation of that syntax.

One last point is important when you’re adding KPIs. When designing your BI solution KPIs,
you need to know whether to choose server-based KPIs, client-based KPIs, or a combina-
tion of both. If your user tools support the display of Analysis Services KPIs, server-based
KPIs are most commonly used because they are created once (in the cube) and reused by
all users. Server-based KPIs present a uniform view of business data and metrics—a view
that is most often preferred to using client-based KPIs. If your requirements call for exten-
sive dashboards that include KPIs, you might want to investigate products such as Microsoft
PerformancePoint Server, which are designed to facilitate quick-and-easy creation of such
dashboards. We cover client tools in general in Part IV.

It’s also often a requirement to provide not only summarized KPIs on a dashboard for end
users, but also to enable drillthrough back to detailed data behind the rollup. In the next sec-
tion, you’ll see how to enable drillthrough (and other types of) actions for an OLAP cube.
Chapter 8 Refining Cubes and Dimensions 233

Note New to Microsoft SQL Server 2008 is the ability to programmatically create KPIs. We cover
this in detail in Chapter 11.

Actions
Actions are custom-defined activities that can be added to a cube and are often presented
as options when an end user right-clicks in an area of a cube browser. A critical consideration
is whether the client applications that you’ve selected for your BI solution support action
invocations. Of course, if you’re creating your own client, such as by implementing a custom
Windows Forms or Web Forms application, you can implement actions in that client in any
way you choose. Common implementations we’ve seen include custom menus, custom short-
cut menus, or both.

The built-in cube browser in BIDS does support actions. This is convenient for testing pur-
poses. To view, add, or change actions, you use the Actions tab in the OLAP cube designer.
There are three possible types of actions: regular, drillthrough, and reporting. Actions are
added to SSAS cubes and targeted at a particular section—that is, an entire dimension, a
particular hierarchy, a particular measure group, and so on. Here is a brief explanation of the
three action types:

■■ Regular action This type of action enables end users to right-click on either cube
metadata or data and to perform a subsequent activity by clicking a shortcut menu
command. This command passes the data value of the cell clicked (as a parameter) to
one of the following action types: Dataset, Proprietary, Rowset, Statement, or URL.
What is returned to the end user depends on which action type has been set up in
BIDS. For example, if a URL action has been configured, a URL is returned to the client,
which the client can then use to open a Web page. Rowset actions return rowsets to the
client and Dataset actions return datasets. Statement actions allow the execution of an
OLE DB statement.
■■ Reporting action This type of action allows end users to right-click on a cell in the
cube browser and pass the value of the location clicked as a parameter to SQL Server
Reporting Services (SSRS). Activating the action causes SSRS to start, using the custom
URL; this URL includes the cell value and other properties (for example, the format of
the report, either HTML, Excel, or PDF). An interesting property is the invocation type.
Options for this property are interactive (or user-initiated), batch (or by a batch process-
ing command), or on open (or application-initiated). This property is available for any
action type and is a suggestion to the client application on how the action should be
handled.
■■ Drillthrough action This type of action enables end users to see some of the detailed
source information behind the value of a particular measure—for example, for this dis-
count amount, what are the values for x,y,z dimensional attributes. As mentioned, this
type of action is used frequently in conjunction with KPIs.
234 Part II Microsoft SQL Server 2008 Analysis Services for Developers

As with the KPI designer, the actions designer in BIDS includes an intelligent metadata
browser. This contains a list of all actions created for the cube, cube metadata, functions,
and action templates. The configuration required for the various action types depends
on which type was selected. For example, for regular actions, you must give your action a
unique name, set the target type, select the target object, set the action type, and provide an
expression for the action. The Target Type option requires a bit more explanation. This is the
type of object that the end user clicks to invoke the actions. You can select from the follow-
ing list when configuring the Target Type option: Attribute Members, Cells, Cube, Dimension
Members, Hierarchy, Hierarchy Members, Level, or Level Members.

In Figure 8-7, the City Map action contains information for the Additional Properties options
of Invocation (which allows you to select Batch, Interactive, or On Open from the drop-down
list), Description (which provides a text box into which you can type in a description), and
Caption (which has a text box for you to type a caption into). Note that the caption informa-
tion is written in MDX in this example and the last optional parameter, Caption Is MDX, has
been set to True.

FiguRe 8-7 When configuring regular actions in BIDS, you must specify the target type, object, condition,
and action content.

You can also optionally supply one or more conditions that are required to be met before
the action is invoked. An example of a condition that could be used for our particular sample
action could be to include an MDX expression that specified that target values (in this case,
cities) had to originate from countries in North America only.

Of course, Figure 8-7 does not display some key information—specifically, that of the Action
Content section. The Action Content section provides a Type drop-down list, from which you
can select Dataset, Proprietary, Rowset, Statement, or URL. As mentioned, the Action Content
parameter specifies the return type of the action result. In our sample, shown in Figure 8-8,
the action result is set to URL.
Chapter 8 Refining Cubes and Dimensions 235

So, for our sample, end users must right-click on a member associated with the Geography
dimension and City attribute to invoke this particular action. As the Description text box in
Figure 8-7 states, the result will display a map for the selected city.

If you look at the action expression in Figure 8-8, you can see that it constructs a URL using
a concatenated string that is built by using the MDX function CurrentMember.Name and the
conditional Case…When…Then expression.

FiguRe 8-8 The Action Content section for a regular cube action contains an MDX statement to return a
result of the type specified in the Type parameter.

The next type of action is a reporting action. This action functions similar to a regular action
with a URL target. A particular area of the cube is clicked and then metadata is passed as a
parameter to an SSRS report URL. In the Adventure Works sample, the reporting action is
named Sales Reason Comparisons. Figure 8-9 shows the configuration parameters. Many of
these parameters—such as Target Type, Target Object, and Condition—are identical to those
for regular actions.

The set of parameters starting with the Report Server section are unique to this action type.
As you can see in Figure 8-9, you need to specify the following SSRS-related options: Server
Name, Report Path, and Report Format. The Report Format drop-down list includes the fol-
lowing choices: HTML5, HTML3, Excel, and PDF. You can optionally supply additional param-
eters in the ProductCategory row. Finally, in the Additional Properties section, you have
236 Part II Microsoft SQL Server 2008 Analysis Services for Developers

options to specify invocation type, a description, and other information, just as you did for
regular actions with URL targets.

FiguRe 8-9 A report cube action associates cube metadata with an SSRS report URL

Another consideration if you’re planning to use reporting actions is determining the type of
credentials and number of hops that will have to be navigated between the physical serv-
ers on which you’ve installed your production SQL Server Analysis Services (SSAS) and SSRS
instances. We cover the topic of SSRS security configuration (and integration with SSAS in
general) in greater detail in Part IV. At this point, you should at least consider what authenti-
cation type (for example, Windows, Forms, or custom) users will use to connect to Reporting
Services. When you implement custom actions, in addition to selecting and configuring
authentication type, you’ll usually also create one or more levels of authorization (such as
roles or groups). We typically see custom actions limited to a subset of the BI project’s end
users (such as the analyst community).

The last type of action available on the Actions tab is the drillthrough action. To understand
what this action enables, look at the sample cube in the BIDS cube browser. Figure 8-10
shows a view of the cube that includes the results of right-clicking on a data cell (that is, a
measure value) and clicking Drillthrough. Drillthrough refers to an ability to make source data
that is included in the cube available (at the level of a particular column or group of columns
from the DSV). Drillthrough does not mean enabling end users to drill back to the source
systems. Of course, if you plan to use drillthrough, you must also verify that all end-user
tools support its invocation. Excel 2007 supports drillthrough. SSRS doesn’t support actions.
Chapter 8 Refining Cubes and Dimensions 237

You can set up an SSRS report to look like it supports drillthrough, but you have to build this
functionality (that is, the links between reports) specifically for each report.

Note It’s easy to confuse drillthrough with drill down when designing client applications.
Drillthrough is a capability of OLAP cubes that can be enabled in BIDS. Drillthrough allows end
users to see more detail about the particular selected item. Drill down is similar but not the same.
Drill down refers to a more general client reporting capability that allows summary-level detail to
be expandable (or linkable) to more detailed line-by-line information.
An example of drill down is a sales amount summary value for some top-level periods, such as
for an entire year, for an entire division, and so on. Drill down allows end users to look at row-
by-row, detailed sales amounts for days, months, individual divisions, and so on. Drillthrough, in
contrast, enables end users to see additional attributes—such as net sales and net overhead—
that contributed to the total sales number.

FiguRe 8-10 A drillthrough cube action allows clients to return the supporting detail data for a given cell.

Another consideration when you’re configuring the drillthrough action is whether to enable
only columns that are part of the business requirements. More included columns equals
more query processing overhead. We’ve seen situations where excessive use of drillthrough
resulted in poor performance in production cubes. There are a few other oddities related to
drillthrough action configurations.
238 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Unlike the process for creating regular actions and reporting actions, when you’re creat-
ing drillthrough actions you need to specify a target measure group (rather than object and
level). Figure 8-11 shows an example of the syntax used to create a drillthrough action tar-
geted at the Reseller Sales measure group.

Note that you can optionally configure the maximum number of rows returned via the
drillthrough query. For the reasons just mentioned, we recommend that you configure this
value based on requirements and load testing.

FiguRe 8-11 Drillthrough actions are targeted at particular measure groups.

Note Sometimes our clients are confused about two features: drillthrough and writeback. Just
to clarify, drillthrough is the ability to view selected source data behind a particular measure
value. Writeback is the ability to view and change source data. Writeback requires significantly
more configuration and is discussed later in this chapter.

A final consideration when implementing drillthrough actions is that they require specific
permission to be set for the roles you want to be able to use drillthrough. These permissions
are set at the level of the cube. Figure 8-12 shows the role designer in BIDS at the Cubes
tab. To set up drillthrough permissions for a particular role’s members, simply enable it via
this tab.
Chapter 8 Refining Cubes and Dimensions 239

FiguRe 8-12 Drillthrough actions require role permission for drillthrough at the level of the cube.

As we did with KPIs, we’ll review the item we most often add to a customer’s OLAP cubes
based on needs that surfaced during the business requirements gathering phases of the proj-
ect. This type of object is quite complex and contains several types. It’s called a calculation
in BIDS.

Calculations (MDX Scripts or Calculated Members)


The next area to use to enhance your OLAP cube using the BIDS cube designer is the
Calculations tab. This tab allows you to add three types of objects to your cube: calculated
members, named sets, and script commands. These objects are all defined using MDX.

We’ll take a light tour of this area in this section, with the intent being for you to get com-
fortable reading the included calculations in the Adventure Works sample. As mentioned, in
a couple of subsequent chapters, we’ll examine the science of authoring native MDX expres-
sions and queries. We understand that you might be quite familiar with (and eager) to code
natively; however, we caution you that MDX is deceptively simple. Please heed our advice:
learn to read it first, and then work on learning to write it.

Note We’ve had many years of experience with this tricky language and still don’t consider
ourselves to be experts. Why is MDX so difficult? The reason for this is not so much the language
itself. MDX’s structure is (loosely) based on Transact-SQL, a language many of you are quite famil-
iar with. Rather, the difficulty lies in the complexity of the OLAP store that you’re querying with
MDX. Retrieving the appropriate subset from an n-dimensional structure is nearly impossible to
visualize, which makes accurate query and expression writing very challenging.

To get started, we’ll open the Calculations tab for the Adventure Works sample in BIDS. The
default view is shown in Figure 8-13. Notice that the tab is implemented in a similar way to
others in BIDS (KPIs, Actions, and so on) in that there is a tree view on the upper left side
that lists all objects. This is the Script Organizer section. There you can see the three types of
objects, with cube icons indicating top-level queries, sheet-of-paper icons indicating script
commands, and blue calculator icons indicating calculated members. Adventure Works also
contains named sets; these are indicated with black curly braces.
240 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 8-13 The Calculations tab in BIDS provides you with a guided calculated member designer.

Below the Script Organizer section, you’ll find the Calculation Tools section, which includes
the Metadata, Functions, and Template tabs. Functions and templates will be filtered to
reflect the object types on this tab—that is, calculated members, and so on. The default
view of the right side of the designer displays the object in a guided interface. In Figure 8-13,
we’ve selected a calculated member, called [Internet Gross Profit Margin].

You can see from Figure 8-13 that you must complete the following information to define a
calculated member:

■■ Name Must be unique. If the name uses embedded spaces, they must be surrounded
by square brackets ([ ])
■■ Parent Properties This section includes options for specifying the names of the hier-
archy and member where the new member will be defined. Most often this will be
the Measures hierarchy. The Measures hierarchy is flat (that is, it has only one level),
so if you’re defining your calculated member here (and most often you will be), there
is no parent member because there is only one level in this dimensional hierarchy. In
Chapters 10 and 11, which cover MDX in detail, we explain situations when you might
want to define calculated members in hierarchies other than measures.
■■ Expression This is the MDX expression that defines how the value will be calculated.
In the simple example shown in Figure 8-13, the MDX expression is very straightfor-
ward—that is, (Sales – Cost) / Sales = Gross Profit. In the real world, these calculations
are often much more complex. Fortunately, MDX includes a rich function library to
make these complex expressions easier to write. At this point, you can think of this
Chapter 8 Refining Cubes and Dimensions 241

statement as somewhat analogous to a cell formula in an Excel workbook. This is an


oversimplification, but it will get you started.
■■ Additional Properties This section is optional. The options in this section include
specifying the format type, visibility (where you can specify that you want to build
calculated members using other calculated members and that sometimes you want
to hide the intermediate measures to reduce complexity or to meet security require-
ments), and more. The Color Expressions and Font Expressions sections function simi-
larly to conditional formatting in Excel, except that you write the expressions in MDX.

You might be surprised to see that the Adventure Works sample cube includes more than 20
calculated members, as well as a number of named sets and script commands. It’s important
that you understand exactly how these objects are created when you’re deciding whether to
include them or not. Table 8-1 summarizes this information.

TAbLe 8-1 Types of Calculated Objects for OLAP Cubes


Object Advantages Disadvantages
Calculated members Calculated at query time; results Although this approach is fast, they’re
are not stored; does not add to not as fast to query as data that is
cube processing or disk storage. stored (that is, part of a fact table).
SSAS engine is optimized to Must be defined by writing MDX
execute; cubes can support hun- expressions.
dreds of calculated members. Must monitor member intersections
if combining from non-measures
hierarchy.
Named sets Can make other MDX queries Must be defined by writing MDX que-
that refer to named sets more ries; query syntax must retrieve correct
readable. subset and do so efficiently.
Easy to understand, similar to Must be retrieved in the correct order if
relational views. multiple named sets are used as a basis
for other MDX queries.
Script commands Very powerful, completely cus- Very easy to make errors when using
tomizable, can limit execution them. Complex to write, debug, and
scope very selectively. maintain. If not written efficiently, can
cause performance issues.

Before we leave our (very brief) introduction to calculations in BIDS, we’ll ask you to take
another look at the Calculations tab. Click the twelfth button from the left on the embed-
ded toolbar, named Script View. Clicking this button changes the guided designer to a single,
all-up MDX script view that contains the MDX statements that define each object (whether
it is a named set, calculated member, or other type of object) in a single file. Be sure to keep
in mind that when you’re adding MDX calculated objects to your OLAP cube, the order in
which you add them is the order in which the script creates them. This is because you could
potentially be creating objects in multiple dimensions; if the objects are created in an order
other than what you intended, inappropriate overwriting could occur.
242 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In fact, by reading the MDX and the included comments, you’ll see several places in the
Adventure Works sample where particular objects are deliberately positioned at a particular
point in the execution order. In Figure 8-14, this information is thoroughly commented. As
you complete your initial review of the included MDX calculations, you’ll want to remember
this important point and be aware that in addition to the physical order of code objects, you
can also use MDX keywords to control the overwrite behavior. We cover this topic in more
detail in Chapters 10 and 11.

FiguRe 8-14 The Calculations tab in BIDS provides a complete view of all scripted objects in the order they
are executed.

As you begin to explore the MDX code window, you might see that there is a syntax validate
button on the embedded toolbar. Oddly, it uses the classic spelling-checker icon—ABC plus a
check mark. This seems rather inappropriate given the complexity of MDX! Also, if you have
development experience coding with other languages using Visual Studio, you might wonder
whether you’ll find any of the usual assistance in the integrated development environment
when you begin to write MDX code manually. Fortunately, you’ll find several familiar fea-
tures. For example, if you click in the left margin in the view shown in Figure 8-14, you can
set breakpoints in your MDX calculations code. We’ve found this to be a very valuable tool
when attempting to find and fix errors. We’ll explore this and other coding aspects of MDX in
future chapters.

Note Chapters 10 and 11 are devoted to a deeper look at MDX syntax. There we’ll look at both
MDX expression and query syntax so that you can better understand, edit, or write MDX state-
ments natively if your business needs call for that.
Chapter 8 Refining Cubes and Dimensions 243

New in SQL Server 2008 are dynamic named sets; the set values are evaluated at each call to
the set. This behavior differs from that of named sets in SQL Server 2005, where named sets
were evaluated only once, at the time of set creation. The new flexibility introduced in SQL
Server 2008 makes named sets more usable. We’ll take a close look at dynamic named set
syntax in Chapter 11.

using Cube and Dimension Properties


Generally, there are a couple of ways to set advanced configurable SSAS object properties
(that is, cube and dimension properties) using BIDS. One way is via the standard pattern
used in Visual Studio: Open the object of interest in Solution Explorer, and then open the
Properties window by pressing F4 or right-clicking on the appropriate object and choosing
Properties. The Properties window displays (or can be set to display) the viewable and con-
figurable object-specific properties.

Tip If you right-click on an object in Solution Explorer and choose Properties, you’ll see only the
container’s properties (the file or database object that the actual code is stored in). To see the
actual object properties, you must open its designer in BIDS and select the appropriate object in
the designer to see all the configurable properties.

Before we drill too deeply into advanced property configurations, we’ll explore yet another
way to configure object properties: by using the Business Intelligence Wizard. This wizard is
accessed by clicking what is usually the first button on the embedded toolbar on any SSAS
object tab. This is shown in Figure 8-15. This option is available in both the cube and dimen-
sion designers.

FiguRe 8-15 The Business Intelligence Wizard is available on the embedded toolbars in BIDS.

After you open the Business Intelligence Wizard, you’ll see on the first page that there are
eight included possible enhancements. Following is a list of the available options, which are
also shown in Figure 8-16:

■■ Define Time Intelligence


■■ Define Account Intelligence
■■ Define Dimension Intelligence
244 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ Specify A Unary Operator


■■ Create A Custom Member Formula
■■ Specify Attribute Ordering
■■ Define Semiadditive Behavior
■■ Define Currency Conversion

FiguRe 8-16 The Business Intelligence Wizard in the BIDS cube designer allows you to make advanced
property configuration changes easily.

What types of changes do these enhancements make to cubes and dimensions? They fall into
two categories: advanced property configurations and MDX calculations, or some combina-
tion of both types of enhancements. We’ll start our tour with advanced property configura-
tions because the most common scenarios have been encapsulated for your convenience in
this wizard.

We find this wizard to be a smart and useful tool; it has helped us to be more productive in a
number of ways. One feature we particularly like is that the suggestions the wizard presents
on the final page are in the format in which they’ll be applied—that is, MDX script, property
configuration changes, and so on. On the final page, you can either confirm and apply the
changes by clicking Finish, cancel your changes by clicking Cancel, or make changes to the
enhancements by clicking Back and reconfiguring the pages for the particular enhancement
that you’re working with.

Note If you invoke the Business Intelligence Wizard from inside a Dimension Editor window, it
displays a subset of options (rather than what is shown in Figure 8-16). The wizard displays only
enhancements that are applicable to dimension objects.
Chapter 8 Refining Cubes and Dimensions 245

Time Intelligence
Using the Define Time Intelligence option allows you to select a couple of values and then
have the wizard generate the MDX script to create a calculated member to add the new
time view. The values you must select are the target time hierarchy, the MDX time function,
and the targeted measure or measures. In our example, we’ve selected Date/Fiscal, Year To
Date, and Internet Sales Amount, respectively, for these values. In Figure 8-17, you can see
one possible result when using the Adventure Works sample. After you review and accept
the changes by clicking Finish, this MDX script is added to the Calculations tab of the cube
designer. There you can further (manually) edit the MDX script if desired.

FiguRe 8-17 The Business Intelligence Wizard in the BIDS cube designer allows you to add calculated
members based on generated MDX.

We frequently use this wizard to generate basic calculated members that add custom time
views based on business requirements. We also have used this wizard with developers who
are new to MDX as a tool for teaching them time-based MDX functions.
246 Part II Microsoft SQL Server 2008 Analysis Services for Developers

SCOPE Keyword
Notice in the preceding script that the calculated member definition in MDX includes a
SCOPE…END SCOPE keyword wrapper. Why is that? What does the SCOPE keyword do? It
defines a subset of a cube to which a calculation is applied. This subset is called a subcube.
Three MDX terms are used to manage scope: CALCULATE, SCOPE, and THIS. The simplest way
to understand how these keywords work is to examine the example presented in SQL Server
Books Online:

/* Calculate the entire cube first. */


CALCULATE;
/* This SCOPE statement defines the current subcube */
SCOPE([Customer].&[Redmond].MEMBERS,
[Measures].[Amount], *);
/* This expression sets the value of the Amount measure */
THIS = [Measures].[Amount] * 1.1;
END SCOPE;

As you can see by this code snippet, the CALCULATE keyword applies some type of custom
calculation to the cells in the entire cube. The SCOPE statement defines a subset (or subcube)
of the entire cube that calculations will be performed against. Finally, the THIS expression
applies a calculation to the defined subcube.

Note For more information, see the SQL Server Books Online topic “Managing Scope and
Context (MDX)” at http://msdn.microsoft.com/en-us/library/ms145515.aspx.

Account Intelligence and Unary Operator Definitions


Using the Define Account Intelligence option (which is available only in the Enterprise edi-
tion of SSAS) allows you to select a couple of values and have the wizard correctly configure
advanced cube or dimension properties so that standard accounting attributes, such as Chart
Of Accounts or Account Type, can be associated with dimension members. To use this option,
you must first select the dimension to which the new attributes will be applied. For our
example, we’ll select the Account dimension from Adventure Works. On the next page of the
wizard, you need to select source attributes in the selected dimension that will be assigned
to standard account values. Also, note that these attributes are set to be semiadditive by
default. This means that the default aggregation (sum) will be overridden to reflect standard
aggregation based on account type—for example, profit = sales – costs. These options are
shown in Figure 8-18.
Chapter 8 Refining Cubes and Dimensions 247

FiguRe 8-18 The Business Intelligence Wizard in the BIDS cube designer allows you to map advanced
properties to dimension attributes.

For the wizard to correctly interpret the semiadditive aggregation behavior, you also have
to configure source attributes to built-in account types, such as asset, balance, and so on.
You do this on the next page of the wizard. The mapping is generated automatically by the
wizard and can be altered if necessary. The last step in the wizard is to review the suggested
results and confirm by clicking Finish. This is shown in Figure 8-19.

As you can see by reviewing the results of the wizard, what you’ve done is configure two
advanced dimension member properties for all selected attributes of the targeted dimen-
sion, which is Account in our sample. The two attributes are AggregationFunction and Alias.
AggregationFunction applies a particular MDX function to override the default SUM function
to Account dimension member attributes based on a standard naming strategy—for exam-
ple, the LastNonEmpty function for the Flow member. Note also the AggregationFunction
property value for the Amount measure of the Financial Reporting measure group has been
set to the value ByAccount.

The Account Intelligence feature is specialized for scenarios in which your business require-
ments include modeling charts of accounts. This is, of course, a rather common scenario. So if
your business intelligence (BI) project includes this requirement, be sure to take advantage of
the wizard’s quick mapping capabilities. You can further modify any of the configured prop-
erties by opening the dimension and attribute property sheets, locating the properties (such
as AggregationFunction), and updating the property values to the desired configuration value
from the property sheet itself.
248 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 8-19 The Business Intelligence Wizard in the BIDS cube designer allows you to define custom
aggregation behavior.

The Specify A Unary Operator feature is used in similar situations as the Account Intelligence
feature—that is, where your business requirements include modeling charts of accounts.
We’ve used the former feature when the modeled source data is financial, but in a more
unstructured or nonstandard type of hierarchy. Another way to understand this difference is
to keep in mind that unary operators are used when you have source attributes that map to
the standard types listed in the Business Intelligence Wizard. You can see the Specify A Unary
Operator page of the Business Intelligence Wizard in Figure 8-20.

We’ll use the Account dimension from Adventure Works to demonstrate the use of the
Specify A Unary Operator feature. Note that on the second page of the wizard the source
data must be modeled in a particular way to use this feature. In Figure 8-21, you’ll see that
the source dimension must include a key attribute, parent attribute, and column that contains
the unary operator (which will define the aggregation behavior for the associated attribute).
When the wizard completes, the property value UnaryOperatorColumn is set to the column
(attribute) value that you specified previously in the wizard.
Chapter 8 Refining Cubes and Dimensions 249

FiguRe 8-20 The Business Intelligence Wizard in the BIDS cube designer allows you to define unary operators
to override default aggregation behavior.

FiguRe 8-21 Using the Specify A Unary Operator enhancement requires a particular structure for the
source data.
250 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Note Usual unary operator values are the following: +, –, or ~. The final value listed, the tilde,
indicates to SSAS that the value should be ignored in the aggregation. Taking a look at the
source data used in creating the Account dimension can help you to understand the use of a
unary operator in defining a custom hierarchy and rollup for accounting scenarios. You can see in
Figure 8-21 that the first value, Balance Sheet, should be ignored. The next 22 values (from Assets
to Other Assets) should be summed. Line 25 (Liabilities And Owners Equity) should be subtracted.
You can also see by looking at the sample data that the hierarchy in the data is defined by the
value provided in the ParentAccount column.

Other Wizard Options


Using the Define Semiadditive Behavior option allows you to define the aggregation method
that you need to satisfy the project’s business requirements using the wizard. This option
is even more flexible than the Account Intelligence or Unary Operator options that we just
looked at. Using this option, you can simply override the automatically detected aggregation
type for a particular attribute value in any selected hierarchy. BIDS applies the types during
cube creation, using the attribute value names as a guide. Any changes that you make in
the wizard update the attribute value named AggregationFunction. You can select from any
one value in the drop-down list. These values are as follows: AverageOfChildren, ByAccount,
Count, DistinctCount, FirstChild, FirstNonEmpty, LastChild, LastNonEmpty, Max, Min, None,
and Sum. We remind you that all of these wizard-generated enhancements are available only
in the Enterprise edition of SSAS.

Using the Define Dimension Intelligence option allows you to map standard business types
to dimension attributes. You use this option when your end-user applications are capable
of adding more functionality based on these standard types. In the first step of the wizard,
you select the dimension to which you want to apply the changes, and on the next page of
the wizard you’ll map the attribute values to standard business types. The wizard suggests
matches as well. After you complete the mapping, the wizard displays a confirmation page
showing the suggested changes, and you can confirm by clicking Finish, or you can click Back
and make any changes. Confirming will result in the Type property for the mapped attributes
to be set to the values that you had previously selected.

Using the Specify Attribute Ordering option allows you to revise the default dimensional
attribute ordering property value. The possible values are (by) Name or (by) Key. As with
other options, you select the dimension and then update the dimension attributes that you
want to change. After you’ve made your selections, the final wizard page reflects the changes
in the attribute ordering property value. This property is named OrderBy.
Chapter 8 Refining Cubes and Dimensions 251

Using the Create A Custom Member Formula option allows you to identify any source col-
umn as a custom rollup column. That will result in replacing the default aggregation for any
defined attributes in the selected dimension with the values in the source column. It’s the
most flexible method of overriding the default sum aggregation. To use this option, you’ll
first select the dimension with which you want to work. Next you’ll map at least one attribute
to its source column. The wizard configures the attribute property CustomRollupColumn to
the value of the source column that you selected.

Currency Conversions
Using the Define Currency Conversion option allows you to associate source data with cur-
rency conversions needed to satisfy your business requirements. Running the wizard results
in a generated MDX script and generated property value configurations. As with the results
of selecting the Define Time Intelligence option, the generated MDX script for the Define
Currency Conversion option creates a calculated member for the requested conversions.
This script is much more complex than the one that the Define Time Intelligence option
generates.

There are some prerequisite structures you must implement prior to running the wizard.
They include at least one currency dimension, at least one time dimension, and at least one
rate measure group. For more detail on the required structure for those dimensions, see the
SQL Server Books Online topic “Currency Conversions.” Also, as shown in Figure 8-22, we’ve
selected the source tables involved in currency conversion (at least for Internet Sales) and
created a data source view (DSV) in BIDS so that you can visualize the relationship between
the source data. Note the indirect relationship between the Internet Sales Facts and Currency
tables via the key relationship in the Currency Rate Facts table in Figure 8-22. The Currency
Rate Facts table relates the Date table to the currency so that the value of currency on a par-
ticular date can be retrieved.

Start the Business Intelligence Wizard and select Define Currency Conversion on the first
page. On the next page of the wizard, you’ll be asked to make three choices. First, you select
the measure group that contains the exchange rate. Using our Adventure Works sample
cube, that measure group is named Exchange Rates.
252 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 8-22 Using the Define Currency Conversion option requires a particular set of source tables in the
dimensions that are affected.

Next you select the pivot (or translating) currency, and then select the method you’ve used to
define the exchange rate. Figure 8-23 shows this wizard page. Continuing our example, we’ve
defined the pivot currency as US dollars. We’ve further specified that the exchange rates have
been entered as x Australian dollars per each US dollar.

On the Select Members page of the wizard, you select which members of the measures
dimension you’d like to have converted. As you configure this option, the Exchange Rate
measure values that you can select from are based on the table columns that are avail-
able from the measure group that you just selected. In our example, the measure group is
Exchange Rates. The source table is Fact Currency Rate. The source columns from that table
are AverageRate or EndOfDateRate. You can see these options in Figure 8-24. Note that you
can select different exchange rate measure values for each selected measure.

If you take a minute to consider the possibilities, you’ll understand that this wizard is power-
ful. For example, it can accommodate sophisticated business scenarios, such as the follow-
ing: Freight cost values should be translated based on end-of-day currency conversion rates,
while Internet sales values should be translated based on average currency conversion rates.
Chapter 8 Refining Cubes and Dimensions 253

FiguRe 8-23 On the Set Currency Conversion Options page, you select the pivot and value currencies.

FiguRe 8-24 On the Select Members page, you define the members and method for currency translation.

You’ll also note by looking at Figure 8-24 that you can apply currency conversions to mem-
bers of dimensions other than the measures dimension. This functionality is particularly
useful if your requirements include modeling accounting source data that is expressed in
multiple currencies.
254 Part II Microsoft SQL Server 2008 Analysis Services for Developers

On the next page of the wizard, you’re presented with three options for performing the
currency localization: many-to-many, many-to-one, or one-to-many. Although there is a
description of each option on the wizard page, we’ll summarize what the options do here:

■■ Many-to-many The source currencies are stored in their local (or original) formats,
and translated using the pivot currency into multiple destinations (or reporting) for-
mats. An example would be to translate currency A, B, C using the pivot currency (US
dollars) into multiple destination currencies D, E, F.
■■ Many-to-one The source currencies are stored in their local (or original) formats, and
translated using the pivot currency. All sources use the pivot currency value as the des-
tination (or reporting) value. An example would be to translate currency A, B, C all into
US dollars.
■■ One-to-many The source currency is stored using the pivot currency. It’s translated
into many reporting currencies. An example would be that all sources store values as US
dollars and translate US dollars into multiple destination currencies.

After you select the cardinality, you have two more choices in this wizard. You must identify
whether a column in the fact table or an attribute value in a dimension should be used to
identify localized currency values. The last step is to select your reporting (or destination)
currencies. The wizard now has enough information to generate an MDX script to execute
your currency calculation. As mentioned, the wizard will also configure a couple of property
values, such as Destination Currency Code (type). You can see the property changes and the
MDX script in the final step of the wizard. If you take a minute to review the resultant MDX
script, you can see that it’s relatively complex. It includes the MDX SCOPE keyword, along
with several other MDX functions. You might also remember that after you click Finish to
confirm the creation of the currency conversion, this script is added to the Calculations tab of
the cube designer so that you can review or update it as needed.

Advanced Cube and Dimension Properties


After you’ve completed your OLAP cube by adding translations, perspectives, actions, KPIs,
and calculations and adding BI with the wizard, you’re ready to build and deploy it. During
development, you usually just use all the default cube processing options so that you can
quickly test and view the results. However, when you move to production, you’ll probably
want to use at least some of the myriad possible configuration settings. The reason for this
is that during development you’ll often just use a small subset of data when processing your
cube. In this situation, cube processing times will probably be short (usually in the minutes or
even seconds). Also, probably no one other than the developers will be accessing the cube,
so if the cube is unavailable for browsing during frequent (test) processing, few people will be
concerned about this.
Chapter 8 Refining Cubes and Dimensions 255

This situation will change when you deploy your cube to production. There you could be
working with massive amounts of data (some clients have cubes in the multi-TB range), as
well as hundreds or even millions of end users who expect nearly constant access to the
cubes. In these situations, you tune the dimension and cube processing settings. For this rea-
son, we turn our attention to how to do just that in the next chapter.

Tip We suggest that you download a free tool from CodePlex called BIDS Helper, which can
be found at http://www.codeplex.com/bidshelper. This tool contains a number of useful utilities
that we use to help us with advanced cube design techniques. The utilities include the following:
Aggregation Manager, Calculation Helpers, Column Usage Reports, Delete Unused Aggregations,
Deploy Aggregation Designs, Deploy MDX Script, Dimension Data Type Discrepancy Check,
Dimension Health Check, Dimension Optimization Report, Measure Group Health Check, Non-
Default Properties Report, Printer Friendly Dimension Usage, Printer Friendly Aggregations,
Tri-State Perspectives, Similar Aggregations, Smart Diff, Show Extra Properties, Update Estimated
Counts, Validate Aggregations, and Visualize Attribute Lattice.

Summary
We’ve covered quite a bit of the functionality available for building OLAP cubes in BIDS, and
we did this in two (albeit long) chapters! Did you notice that we talked about concepts for six
chapters and implementation for only two chapters? The fact that BIDS is easy to use if you
fully understand OLAP concepts before you start developing your BI solutions is a key point
to remember from reading this book. We’ve repeatedly seen projects in trouble because of
the (natural) tendency to just open BIDS and get on with it. If we’ve done our job, we’ve given
you enough information to develop a very useful OLAP cube by now.

We still have more to cover, though. In the next chapter, we’ll dive into the world of cube
processing. After that, we’ll look at MDX in detail. Still later, we’ll get to data mining. You have
covered a good bit of ground already, and the basics of OLAP are under your belt at this
point!
Chapter 9
Processing Cubes and Dimensions
Now that you have developed and refined an OLAP cube, it’s time to learn how to build and
process Microsoft SQL Server 2008 Analysis Services (SSAS) objects and deploy them from
the Business Intelligence Development Studio (BIDS) environment to the SSAS server. To
enable you to understand what steps are involved in cube and dimension processing, we’ll
first define the two different types of information that you’ll work with in creating OLAP
cubes: metadata and cube data. Then we’ll explain aggregations and examine the role that
XMLA plays in creating OLAP cubes and dimensions. We’ll close the chapter with a detailed
look at the possible cube data storage modes: multidimensional OLAP (MOLAP), hybrid
OLAP (HOLAP), relational OLAP (ROLAP), or a combination of all three.

Building, Processing, and Deploying OLAP Cubes


After you’ve completed your OLAP cube by optionally adding translations, perspectives,
actions, KPIs, calculations, and business intelligence logic, you’re ready to build and deploy it.
During development, you typically use the default cube processing options so that you can
quickly test and view the results. However, when you move to production, you probably want
to use at least some of the myriad possible configuration settings. The reason for this is that
during development you’ll often just use a small subset of data when processing your cube.
In this situation, cube processing times will probably be short (usually measured in minutes or
even seconds). Also, probably no one other than the developers will be accessing the cube,
so if the cube is unavailable for browsing during frequent test full cube processing, then few
people will be concerned.

The situation changes when you deploy your cube to production. There you could be work-
ing with massive amounts of data (some clients have cubes in the multiterabyte size) and
hundreds or even millions of users who expect nearly constant access to the cubes. To pre-
pare for these situations, you must tune the dimension and cube processing settings.

BIDS SSAS database projects contain one or more cubes. Each cube references one or more
dimensions. Dimensions are often shared by one or more cubes in the same project. The
type of object or objects you choose to process—that is, all objects in a particular solution, a
particular cube, all cubes, a particular dimension, all dimensions, and so on—will vary in pro-
duction environments depending on your business requirements.

The most typical scenario we encounter is some sort of cube process (usually an update that
adds only new records) that is run on a nightly basis. This type of cube update automati-
cally includes all associated dimensions to that cube (or those cubes). Because the available

257
258 Part II Microsoft SQL Server 2008 Analysis Services for Developers

options for processing individual dimensions are similar to those available for a cube—for
example, Process Full or Process Data—we’ll focus in this chapter on cube processing options.
Be aware that most of the options we present are available for processing one or more
dimensions as well. The most common business reason we’ve encountered for processing
single dimensions separate from entire cubes is size—that is, the number of rows in source
tables and the frequency of updates in dimension tables. An example of this is near real-time
reporting by customer of shipping activity by a major global shipping company.

Your choices for deploying and processing changes depend on how you’re working in BIDS.
If you’re working in a disconnected environment with a new cube or dimension object, you
have the following choices for the solution: build, rebuild, deploy, or process. If you’re work-
ing with a connected object, you need only to process cubes or dimensions. In the latter case,
there is no need to build or deploy because the object or objects you’re working on have
already been built and deployed to the SSAS Service at least once prior to your current work
session.

Differentiating Data and Metadata


As we begin our discussion of SSAS object processing, we need to review whether the infor-
mation that makes up the objects is considered data or metadata. We are reviewing this con-
cept because it is important that you understand this difference when you are making cube
or dimension processing mode choices.

The simplest way to grasp this concept is as follows. The data, or rows, in the fact table are
data; everything else is metadata. For most developers, it’s easy to understand that the XMLA
that defines the dimensions, levels, hierarchies, and cube structure is metadata. What is
trickier to grasp is that the rows of information in the dimension tables are also metadata. For
example, in a Customer dimension, the name of the table, names of the columns, values in
the columns, and so on are metadata. These partially define the structure of the cube.

But how big is the side (or dimension) of the cube that is defined by the Customer dimen-
sion? This can only be determined by the number of rows of data in the customer source
table. For this reason, these rows are part of the metadata for the cube. Another consider-
ation is that in most cubes the physical size (that is, number of rows) in the fact table is larger
by an order of magnitude than the size of any one dimension table. To understand this con-
cept, think of the example of customers and sales amount for each one. In most businesses,
repeat customers cause the customer source table rows to have a one-to-many relationship
with the sales (item instance) rows in the fact table.

Fact tables can be huge, which is why they can be divided into physical sections called parti-
tions. We’ll explore logical and physical partitioning later in this chapter. At this point, it’s
important that you understand that the data (rows) in the fact table have different processing
options available than the rows in the dimension tables—the latter being considered meta-
data by SSAS.
Chapter 9 Processing Cubes and Dimensions 259

Working in a Disconnected Environment


We’ll start by taking a look at the disconnected instance. You can build or rebuild an SSAS
project by right-clicking on the project in the Solution Explorer window or by selecting the
Build menu on the toolbar. What does build or rebuild do? In the BIDS environment, build
means that the XMLA metadata that you’ve generated using the tools, designers, and wizards
is validated against a series of rules and schemas.

There are two types of error conditions in BIDS. One is indicated by a blue squiggly line and
indicates that a violation of the Analysis Management Objects (AMO) best practice design
guidelines has occurred. AMO design guidelines are a new feature in SQL Server 2008. The
other type of build error is indicated by a red squiggly line. This type of error indicates a fatal
error and your solution will not build successfully until after you correct any and all of these
errors.

To view or change the default error conditions, right-click on the solution name and then
click Edit Database. You see a list of all defined error conditions on the Warnings tab, as
shown in Figure 9-1.

Figure 9-1 The Warnings tab in BIDS lists all defined design error conditions.

Blue errors are guidelines only. You can correct these errors or choose to ignore them, and
your solution will still build successfully. As we noted, red errors are fatal errors. As with
other types of fatal build errors (for example, those in .NET), Microsoft provides tooltips and
detailed error messages to help you correct these errors. When you attempt to build, both
types of errors will be displayed in the Error List window. If you click a specific error in the
error list, the particular designer in BIDS where you can fix the error will open, such as a par-
ticular dimension in the dimension designer. A successful build results when all the informa-
tion can be validated against rules and schemas. There is no compile step in BIDS.

Note The Rebuild option attempts validation to the metadata that has changed since the last
successful build. It is used when you want to more quickly validate (or build) changes you’ve
made to an existing project.
260 Part II Microsoft SQL Server 2008 Analysis Services for Developers

After you’ve successfully built your project in a disconnected BIDS instance, you have two
additional steps: process and deploy. The Deploy option is available only when you right-
click the solution name in Solution Explorer. If you select Deploy, the process step is run
automatically with the default processing options. When you select this option, all objects
(cubes, dimensions) are deployed to the server and then processed. We’ll get into details on
exactly what happens during the process step shortly. Deployment progress is reflected in a
window with that same name. At this point, we’ll summarize the deploy step by saying that
all metadata is sent to the server and data is prepared (processed) and then loaded into a
particular SSAS instance. This action results in objects—that is, dimensions and cubes—being
created, and then those objects are loaded with data. The name and location of the target
SSAS server and database instance is configured in the project’s property pages. Note the
order in which the objects are processed and loaded—all dimensions are loaded first. After
all dimensions are successfully loaded, the measure group information is loaded. You should
also note that each measure group is loaded separately. It’s also important to see that in the
Deployment Progress window each step includes a start and end time. In addition to view-
ing this information in the Deployment Progress window, you can also save this deployment
information to a log file for future analysis. You can capture the processing information to
a log file by enabling the Log\Flight Recorder\Enabled option in SSMS under the Analysis
Server properties. This option captures activities, except queries, on SSAS. To enable query
activity logging, you can either enable a SQL Server Profiler trace or enable Query Logging
on SSAS by changing the Log\QueryLog server properties using SSMS.

It is only after you’ve successfully built, processed, and deployed your cube that you can
browse it in the BIDS cube browser. Of course, end users will access the cube via various cli-
ent tools, either Microsoft Office Excel or a custom client. Subsequent processing will over-
write objects on the server.

Understand that, during processing, dimensions or entire cubes can be unavailable for end
user browsing. The different process modes—such as full, incremental, and so on—determine
this. From a high level, the Full Process option causes the underlying object (cube or dimen-
sion) to be unavailable for browsing, because this type of process is re-creating the object
structure, along with repopulating its data. Because of this limitation, there are many ways for
you to tune dimension and cube processing. We will examine the varying levels of granularity
available for object processing in more detail as we progress through this chapter.

During development, it’s typical to simply deploy prototype and pilot cubes quickly using the
default settings. As mentioned, this type of activity usually completes within a couple of min-
utes at most. During default processing, objects are overwritten and the cube is not available
for browsing.

As mentioned, during production, you’ll want to have much finer control over SSAS object
processing and deployment. To help you understand your configuration options, we’ll first
take a closer look at what SSAS considers metadata and what it considers to be data. The rea-
son for this is that each of these items has its own set of processing and deployment settings.
Chapter 9 Processing Cubes and Dimensions 261

Metadata and data have separate, configurable processing settings. Before we leave the sub-
ject of metadata and data, we have to consider one additional possible type of data that we
might have to contend with that determines how to process our cube. This last type of data is
called aggregations.

Working in a Connected Environment


As mentioned, when working with SSAS databases in BIDS in a connected mode, the Build,
Rebuild, and Deploy options are not available. Your only choice is to process the changes
that you’ve made to the XMLA (metadata). Why is this? You’ll recall that the Build command
causes SSAS to validate the information you’ve updated in BIDS against the internal schema
to make sure that all updates are valid. When you’re working in connected mode, this valida-
tion occurs in real time. That is, if you attempt to make an invalid change to the object struc-
ture, either a blue squiggly line (warning) or a red squiggly line (error) appears at the location
of the invalid update after you save the file or files involved. There is no deploy step available
when you are working in connected mode because you’re working with the live metadata
files. Obviously, when working in this mode it’s quite important to refrain from making
breaking changes to live objects. We use connected mode only on development servers for
this reason.

The process step is still available in connected mode, because if you want to make a change
to the data rather than the metadata, you must elect to process that data by executing a
Process command (for the cube or for the dimension). We’ll be taking a closer look at the
available processing mode options later in this chapter.

Understanding Aggregations
As we begin to understand the implications of cube processing options, let’s explore a bit
more about the concept of an OLAP aggregation. What exactly is an aggregation? It’s a pre-
aggregated, (usually summed) stored value. Remember that data is loaded into the cube
from the rows in the fact table. These rows are loaded into the source fact table from the var-
ious source systems at the level of granularity (or detail) defined in the grain statements. For
example, it’s typical to load the fact rows at some time granularity. For some clients, we’ve
loaded at the day level—that is, sales per day; for others, we’ve loaded at the minute level.

In some ways, an OLAP aggregation is similar to an index of a calculated column of a rela-


tional table—that is, the index causes the results of the calculations to be stored on disk,
rather than the engine having to calculate them each time they are requested. The differ-
ence, of course, is that OLAP cubes are multidimensional. So another way to think of an
aggregation is as a stored, saved intersection of aggregated fact table values. For the pur-
poses of processing, aggregations are considered data (rather than metadata). So, the data in
a cube includes the source fact table rows and any defined aggregations.
262 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Another aspect of aggregations is that the SSAS query engine can use intermediate aggrega-
tions to process queries. For example, suppose that sales data is loaded into the cube from
the fact table at the day level, and you can add aggregations at the month level to sum the
sales amounts. A query to the year level could then use the month-level aggregations rather
than having to retrieve the row-level (daily) data from the cube. Because of this, including
appropriate aggregations is extremely important to optimal production query performance.
Of course, aggregations are not free. Adding aggregations requires more storage space on
disk and increases cube processing time. In fact, creating too many aggregations (which is
known as overaggregation) is a common mistake that we see. Overaggregation results in long
cube processing times and higher disk storage requirements without producing noticeable
query response performance gains.

Because of the importance of appropriate aggregation, Microsoft has added a new tab to
the cube designer in SQL Server 2008 to help you to view and refine aggregations. This tab is
shown in Figure 9-2.

Figure 9-2 The BIDS cube designer includes a new Aggregations tab.

Figure 9-2 shows the default aggregation design for the Adventure Works cube. By default,
there are no aggregations designed when you create a new BI project. You might be won-
dering why. The reason is that the SSAS query engine is highly optimized and because the
underlying cube structure is designed for fast query, it may be acceptable to deploy produc-
tion cubes with no aggregations. You’ll remember that cube processing times are increased
when you add aggregations. So the default is 0 (zero) aggregations. You use this setting so
that the cube processing time is the absolute fastest that it can be.

For some smaller clients, we have found that this default setting results in acceptable query
performance with very fast cube processing times. Of course, there is data latency in this
scenario because new data is introduced only into the cube when it is processed. We’ve seen
many clients who work with the default processing settings because they are simple and
nearly automatic (that is, no custom configuration is required), and because the data update
cycle fits with their business requirements. Typically, the cube is updated with new data once
nightly when no one is connecting to it because the business runs only during a working day
(that is, from 9 A.M. to 5 P.M.).
Chapter 9 Processing Cubes and Dimensions 263

Even though small-sized cubes can work with no aggregations, we find that the majority of
our clients choose to add some amount of aggregation to their cubes. Some of these sce-
narios include the following:

■■ Medium-sized cubes—20 GB or more—with complex designs, complex MDX queries, or


a large number of users executing queries simultaneously
■■ Very large, or even huge, cubes—defined as 100 GB or more
■■ Demand for cube availability 24 hours a day, 7 days a week—leaving very small mainte-
nance windows
■■ Requirement for minimal data latency—in hours, minutes, or even seconds

For more conceptual information about OLAP aggregations, see the SQL Server Books Online
topic “Aggregations and Aggregation Designs.” Closely related to the topic of aggregations is
that of partitions. You’ll need to understand the concepts and implementation of cube parti-
tions before exploring the mechanics of aggregation design. After we look at partitions, we’ll
return to the topic of aggregations, learning how to create, implement, and tune custom
aggregation designs.

We’ll have more to say about working with aggregations later in this chapter (in the
“Implementing Aggregations” section). To understand how to do this, you’ll first have to
understand another concept—cube partitioning. We’ll take a closer look at that next.

Partitioning
A partition is a logical and physical segmentation of the data in a cube. Recall that OLAP data
is defined as detail rows and any associated aggregations for data retrieved from a particular
source fact table. Partitions are created to make management and maintenance of cube data
easier. Importantly, partitions can be processed and queried in parallel. Here are some more
specific examples:

■■ Scalability Partitions can be located in different physical locations, even on different


physical servers.
■■ Availability Configuring processing settings on a per-partition basis can speed up
cube processing. Cubes can be unavailable for browsing during processing, depending
on the type of processing that is being done. (We’ll say more about this topic later in
this chapter.)
■■ Reducing storage space Each partition can have a unique aggregation design. You’ll
recall that aggregations are stored on disk, so you might choose one aggregation
design for a partition to minimize disk usage, and a different design on another parti-
tion to maximize performance.
264 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ Simplifying backup In many scenarios, only new data is loaded (that is, no changes
are allowed to existing data). In this situation, often the partition with the latest data is
backed up more frequently than partitions containing historical data.

If you look at the disconnected example of the Adventure Works cube designer Partitions
tab shown in Figure 9-3, you can see that we have one or more partitions for each measure
group included in the cube. By default, a single partition is created for each measure group.

Each of the sample cube partitions is assigned a storage type of MOLAP (for multidimen-
sional OLAP storage, which we’ll cover next) and has zero associated aggregations. We’ll work
with the Internet Sales measure group to examine the existing four partitions for this mea-
sure group in this sample cube.

Figure 9-3 The Partitions tab shows existing OLAP cube partitions

To examine a partition for a cube, you’ll need to click on the Source column on the Partitions
tab of the cube designer. Doing this reveals a build (…) button. When you click it, the Partition
Source dialog box opens. There, you see that the binding type is set to Query Binding. If
you wanted to create a new partition, you could bind to a query or to a table. Reviewing the
value of the first partition’s Query Binding, you see that the query splits the source data by
using a WHERE clause in the Transact-SQL query. This is shown in Figure 9-4.

Partitions are defined in a language that is understood by the source system. For example, if
you’re using SQL Server as source fact table data, the partition query is written in Transact-
SQL. The most common method of slicing or partitioning source fact tables is by filtering
(using a WHERE condition) on a time-based column. We usually partition on a week or month
value. This results in either 52 partitions (for weeks) or 12 partitions (for months) for each
year’s worth of data stored in the fact table. Of course, you can use any partitioning scheme
that suits your business needs.

After you’ve split an original partition, you can define additional new partitions. You can
easily see which partitions are defined on a query (rather than on an entire table) on the
Partitions tab of BIDS. The source value shows the query used rather than the table name, as
you can see in Figure 9-5.
Chapter 9 Processing Cubes and Dimensions 265

Figure 9-4 Transact-SQL query with the WHERE clause

Figure 9-5 The Source column of a partition reflects the query upon which its definition is based.

To define additional partitions, click the New Partition link under the particular partition that
you want to split. Doing this will start the Partition Wizard.

In the first step of the Partition Wizard, you are asked to select the source measure group
and source table on which to define the partition. Next you enter the query to restrict the
source rows.

Note Verify each query you use to define a partition. Although the wizard validates syntax, it
does not check for duplicate data. If your query does not cleanly split the source data, you run
the risk of loading your cube with duplicate data from the source fact table.
266 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In the next step, you select the processing and physical location for the partition. The avail-
able options are shown in Figure 9-6. In the Processing Location section, you can choose
the Current Server Instance option (the default) or the Remote Analysis Services Data
Source option. The Storage Location section enables you to choose where the partition will
be stored. You can select either the Default Server Location option (and specify a default
location in the accompanying text box) or the Specified Folder option (and, again, specify
the location in the accompanying text box). Any nondefault folder has to be set up as an
approved storage folder on the server ahead of time.

Figure 9-6 When defining OLAP partitions, you can change the physical storage location of the new
partition.

After you’ve defined the partitioning query to restrict the rows and you’ve picked the pro-
cessing and storage locations, in this same wizard you can then design aggregation schemes
for the new partition. As mentioned, we’ll return to the subject of custom aggregation design
later in this chapter. You can also optionally process and deploy the new partition at the
completion of the wizard.

The next step in the custom processing configuration options can be one of two. You can
begin to work with the three possible storage modes for data and metadata: MOLAP, HOLAP,
or ROLAP. Or you can design aggregations. We’ll select the former as the next phase of our
tour and move on to an explanation of physical storage mode options.
Chapter 9 Processing Cubes and Dimensions 267

Choosing Storage Modes: MOLAP, HOLAP, and ROLAP


A next step in cube processing customization is to define the storage method for a particu-
lar partition. You’ll recall that, by default, each fact table creates exactly one partition with a
storage type of MOLAP. There are three possible types of storage for a particular partition:
MOLAP, HOLAP, and ROLAP.

MOLAP
In multidimensional OLAP (MOLAP), a copy of the fact table rows (or facts) is stored in a
format native to SSAS. MOLAP is not a one-for-one storage option. Because of the efficient
storage mechanisms used for cubes, storage requirements are approximately 10 to 20 per-
cent of the size of the original data. For example, if you have 1 GB in your fact table, plan
for around 200 MB of storage on SSAS. Regardless of the high level of efficiency when using
MOLAP, be aware that you are choosing to make a copy of all source data. In addition, any
and all aggregations that you design are stored in the native SSAS format. The more aggre-
gations that are designed, the greater the processing time for the partition and the more
physical storage space is needed. We occasionally use storage options other than MOLAP
specifically to reduce partition processing times.

MOLAP is the default storage option because it results in the fastest query performance. The
reason for this is that the SSAS query engine is optimized to read data from a multidimen-
sional source store. We find that it’s typical to use MOLAP for the majority of the partitions in
a cube. We usually add some aggregations to each partition. We’ll discuss adding aggrega-
tions in the next topic of this chapter.

HOLAP
Hybrid OLAP (HOLAP) does not make a copy of the source fact table rows in SSAS. It reads
this information from the star schema source. Any aggregations that you add are stored in
the native SSAS format on the storage location defined for the SSAS instance. This storage
mode results in a reduction of storage space needed. This option is often used for parti-
tions that contain infrequently queried historical data. Because aggregations are stored in
the native format and result in fast query response time, it’s typical to design a slightly larger
number of aggregations in this scenario.

ROLAP
Relational OLAP (ROLAP) does not make a copy of the facts on SSAS. It reads this information
from the star schema source. Any aggregations that are designed are written back to tables
on the same star schema source system. Query performance is significantly slower than that
of partitions using MOLAP or HOLAP; however, particular business scenarios can be well
268 Part II Microsoft SQL Server 2008 Analysis Services for Developers

served by using ROLAP partitions. We’ll discuss these situations in more detail later in this
chapter. These include the following:

■■ Huge amounts of source data, such as cubes that are many TBs in size
■■ Need for near real-time data—for example, latency in seconds
■■ Need for near 100 percent cube availability—for example, downtime because of pro-
cessing limited to minutes or seconds

Figure 9-7 shows a conceptual view of ROLAP aggregations. Note that additional tables are
created in the OLTP source RDBMS. Note also that the column names reflect the positions in
the hierarchy that are being aggregated and the type of aggregation performed. Also, it is
interesting to consider that these aggregation tables can be queried in Transact-SQL, rather
than MDX, because they are stored in a relational format.

Figure 9-7 Conceptual view of ROLAP aggregations, shown in aggTable1

OLTP Table Partitioning


If your star schema source data is stored in the Enterprise edition of SQL Server 2005 or later,
you have the ability to do relational table partitioning in your star schema source tables.

Note RDBMS systems other than SQL Server also support relational table portioning. If your
source system or systems support this type of partitioning, you should consider implementing
this feature as part of your BI project for easier maintainability and management.
Chapter 9 Processing Cubes and Dimensions 269

An OLTP table partitioning strategy can complement any partitioning you chose using SSAS
(that is, cube partitions), or you can choose to partition only on the relational side. You’ll need
to decide which type (or types) of partitioning is appropriate for your BI solution.

Table partitioning is the ability to position data from the same table on different physical
locations (disks) while having that data appear to continue to originate from the same logical
table from the end user’s perspective. This simplifies management of very large databases
(VLDBs)—in particular, management of very large tables. The large tables we’re concerned
about here are, of course, fact tables. It’s not uncommon for fact tables to contain millions or
tens of millions of rows. In fact, support for especially huge (over four billion rows) fact tables
is one of the reasons that the data type BIGINT was introduced in SQL Server 2005. Relational
table partitioning can simplify administrative tasks and general management of these, often
large or even huge, data sources. For example, backups can be performed much more effi-
ciently on table partitions than on entire (huge) fact tables. Although relational table parti-
tioning is relatively simple, several steps are involved in implementing it. Here’s a conceptual
overview of the technique:

1. Identify the tables that are the best candidates for partitioning. For OLAP projects, as
mentioned, this will generally be the fact tables.
2. Identify the value (or column) to be used for partitioning. This is usually a date field. A
constraint must be implemented on this column of the tables that will participate in
partitioning.
3. Implement the physical architecture needed to support partitioning—that is, install the
physical disks.
4. Create file groups in the database for each of the new physical disks or arrays.
5. Create .ndf files (or secondary database files) for the SQL Server 2005 (or later) database
where the tables to be partitioned are contained, and associate these .ndf files with the
file groups you created in step 4.
6. Create a partition function. Doing this creates the buckets to distribute the sections of
the table into. The sections are most often created by date range—that is, from xxx to
yyy date, usually monthly or annually.
7. Create a partition scheme. Doing this associates the buckets you created previously
with a list of file groups, one file group for each time period, such as month or year.
8. Create the table (usually the fact table) on the partition scheme that you created earlier.
Doing this splits the table into the buckets you’ve created.

Note If your source data is stored in the Enterprise edition of SQL Server 2008 and on a server
with multiple CPUs, you can also take advantage of enhancements in parallel processing of fact
table partitions that are the result of changes in the query optimizer.
270 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Other OLAP Partition Configurations


One other consideration in the world of partitions returns us to SSAS cube partitions. Here it
is possible to define cube partitions as local (the default) or remote. You define this location
when you create the partition using the Partition Wizard shown previously in Figure 9-6, or
you can manually configure the partition properties after you create it using its Properties
dialog box. The primary reason to consider using remote partitions is to do a kind of load
balancing in the SSAS environment. You use remote partitions to implement load balancing
in situations where your primary SSAS server is stressed (usually) because a large number of
users are executing complex queries. By using remote partitions, you can split the process-
ing work across multiple physical servers. There are also other things you must consider
when using remote partitions. Remote partitions can use MOLAP, HOLAP, or ROLAP storage.
Remote partitions store some information on both the local server and the remote servers.

If you’re using remote MOLAP, data and aggregations for the remote partition are stored on
the remote server. If you’re using remote HOLAP, aggregations for the remote partition are
stored on the remote server while data is read from the OLTP source. If you’re using remote
ROLAP, nothing is stored on the remote server; both data and aggregations are read from
the OLAP source.

Before we review the details of how and why to change a partition from the default storage
mode of MOLAP to HOLAP or ROLAP, let’s return to the topic of aggregations. A key driver
of storage mode change is processing time. Processing time is affected by both the quantity
of source fact table rows and the quantity of defined aggregations. We find there is need to
balance query response performance with processing time. Most clients prefer the fastest
query response time, even if that means that processing time results in a bit of latency—in
other words, MOLAP with some added aggregations.

Now that you understand storage modes, we’ll return to aggregations. In addition to deter-
mining storage mode for source data, storage modes determine storage location and type
for any aggregations associated with a partition.

implementing Aggregations
Why is it so important to add the correct amount of aggregations to your cube’s partitions?
As stated previously, it might not be. Keep in mind that some SSAS cubes do not require
aggregations to enable them to function acceptably. Similar to the the idea that small OLTP
databases need no relational indexes, if your cube is quite small (under 5 GB) and you have
a small number of end users (100 or less), you might not have to add any aggregations at
all. Also, adding aggregations to a particular partition is done for the same reason that you
add indexes to an RDBMS—to speed up query response times. The process to add these
aggregations is only marginally similar to the process of adding indexes to an RDBMS. BIDS
Chapter 9 Processing Cubes and Dimensions 271

includes tools, wizards, and a new Aggregations tab in the cube designer to give you power
and control over the aggregation design process. However, we find that many customers who
have relational database backgrounds are quite confused by the aggregation design process.
Here are some key points to remember:

■■ The core reason to add aggregations to a partition is to improve query response time.
Although you could tune the MDX source query, you’re more likely to add aggrega-
tions. The reason is that the cost of aggregations is relatively small because they’re usu-
ally quick and easy to add, particularly if you’re using MOLAP storage.
■■ Do not overaggregate. For MOLAP, 20 to 30 percent aggregation is usually sufficient.
Heavy—for example, over 50 percent—aggregation can result in unacceptably long
partition processing times. Remember that the SSAS query engine makes use of inter-
mediate aggregations to get query results. For example, for facts loaded at the day
level, aggregated at the month level, and queried at the year level, month-level aggre-
gations are used to answer the query request.
■■ Use the aggregation tools and wizards prior to manually adding aggregations. If source
MDX queries are not improved by adding aggregations recommended by the tools,
consider rewriting MDX queries prior to adding aggregations manually.
■■ Consider the following facts when adding aggregations to a cube: aggregations
increase cube processing times, and aggregations increase the storage space required
for the cube on disk.
■■ The storage type affects the amount of aggregations you’ll add. You’ll need to add the
smallest percentage of disk space for MOLAP storage because the source data is avail-
able to the SSAS query engine in the native multidimensional format. HOLAP storage
usually requires the largest percentage of aggregations. This is done to preclude the
need for the SSAS query engine to retrieve data from the RDBMS source system.

Note The new AMO design warnings tool generates warnings when you attempt to build an
OLAP cube that includes nonstandard aggregation designs.

In the next few sections, we look closely at the following wizards and tools that help you
design appropriate aggregations: Aggregation Design Wizard, Usage-Based Optimization
Wizard, SQL Server Profiler, and the Advanced view of the aggregations designer.

Aggregation Design Wizard


The Aggregation Design Wizard is available in BIDS (and SQL Server Management Studio).
You access the wizard by clicking on the measure group that contains the partition (or par-
titions) you want to work with and then clicking the Design Aggregations button on the
toolbar on the Aggregations tab in BIDS. This is shown in Figure 9-8. Doing this opens the
272 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Aggregation Design Wizard, which asks you to select one or more partitions from those
defined in the measure group that you originally selected.

Design Aggregations
Figure 9-8 The new Aggregations tab in BIDS allows you to define aggregation schemes for your OLAP cube
partitions.

In the next step of the wizard, you review the default assigned settings for aggregations for
each dimension’s attributes. You have four options: Default, Full, None, and Unrestricted.
We recommend leaving this setting at Default unless you have a specific business reason to
change it. An example of a business reason is a dimensional attribute that is browsed rarely
by a small number of users. In this case, you might choose the None option for a particular
attribute. In particular, we caution you to avoid the Full setting. In fact, if you select it, a warn-
ing dialog box is displayed that cautions you about the potential overhead incurred by using
this setting. If you choose Default, a default rule is applied when you further configure the
aggregation designer. This step, which you complete in the Review Aggregation Usage page
of the Aggregation Design Wizard, is shown in Figure 9-9 for the Internet Sales partition from
the measure group with the same name.

Figure 9-9 The Aggregation Design Wizard allows you configure the level of desired aggregation for
individual dimensional attributes.
Chapter 9 Processing Cubes and Dimensions 273

In the next step of this wizard, you have to either enter the number of rows in the partition or
click the Count button to have SSAS count the rows. You need to do this because the default
aggregation rule uses the number of rows as one of the variables in calculating the sug-
gested aggregations. You are provided the option of manually entering the number of rows
so that you can avoid generating the count query and thereby reduce the overhead on the
source database server. After you complete this step and click Next, the wizard presents you
with options for designing the aggregations for the selected partition. You can choose one of
the following four options:

■■ Estimated Storage Reaches With this option, you fill in a number, in MB or GB, and
SSAS designs aggregations that require up to that size limit for storage on disk.
■■ Performance Gain Reaches With this option, you fill in a percentage increase in query
performance speed and SSAS designs aggregations until that threshold is met.
■■ I Click Stop When you choose this option, the wizard stops adding aggregations
when you click stop.
■■ Do Not Design Aggregations If you select this option, the wizard does not design
any aggregations.

After you select an option and click Start, the wizard presents you with an active chart
that shows you the number of aggregations and the storage space needed for those
aggregations.

If you select the Performance Gain Reaches option on the Set Aggregation Options page, a
good starting value for you to choose is 20 percent. As discussed earlier, you should refrain
from overaggregating the SSAS cube. Overaggregating is defined as having more than 50
percent aggregations. You can see the results of a sample execution in Figure 9-10. For the
sample, we selected the first option, Estimated Storage Reaches. Note that the results tell you
the number of aggregations and the storage space needed for these aggregations. You can
click Stop to halt the aggregation design process.

After the wizard completes its recommendations and you click Next, you can choose whether
to process the partition using those recommendations immediately or save the results for
later processing. This is an important choice because the aggregations that you’ve designed
won’t be created until the cube is processed. You can also apply the newly created aggre-
gation design to other partitions in the cube. You do this by clicking the Assign Aggrega-
tion Design button (the third button from the left) on the toolbar on the Aggregations tab
in BIDS.
274 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 9-10 The Aggregation Design Wizard allows you to choose from four options to design aggregations.

Usage-Based Optimization Wizard


The Usage-Based Optimization Wizard works by saving actual queries sent to the Analysis
Services database. The saved queries are based on parameter values that you specify, such
as start and end time, user name, and so on. The wizard then uses an algorithm to figure out
which aggregations will best improve the performance of the queries that are run and that
fall within the configured parameters. Because query performance is determined as much by
the selected (and filtered) queries coming from your client applications as it is by the data,
using the Usage-Based Optimization Wizard effectively is an intelligent approach. By using
this tool, you are causing SSAS to create aggregations specifically for the particular queries,
rather than just using the blanket approach that the Aggregation Design Wizard uses.

There are three SSAS properties you must configure prior to running the wizard. The first is
called the QueryLogConnectionString. You set this value to the database connection string
where you’d like to store the query log table. The data stored in this table will be retrieved by
the Usage-Based Optimization Wizard. (This process is similar to the usage of a trace table
by the database tuning advisor for OLTP relational index optimization.) To set this property,
open SQL Server Management Studio (SSMS), right-click the SSAS instance, and then select
Properties. On the general page, locate the property in the list labeled Log\QueryLog\
QueryLogConnectionString. Click the build (...) button in the value column for this property,
and specify a connection string to a database. If you use Windows Authentication for this
connection, be aware that the connection will be made under the credentials of the SSAS ser-
vice account.
Chapter 9 Processing Cubes and Dimensions 275

The second property is the CreateQueryLogTable property. Set this to True to have SSAS cre-
ate a table that logs queries. This table will be used to provide queries to the wizard. This
process is similar to using a trace table to provide queries to SQL Server Profiler for relational
database query tuning. You can optionally change the default name of the query log table
for the database you previously defined. This value is set to OlapQueryLog by default and can
be changed by setting the QueryLogTableName property.

The third property to set is QueryLogSampling. The default is to only capture 1 out of 10
queries. You’ll probably want to set this to 1 for your sampling purposes so that every query
within the defined parameter set is captured. However, just like using SQL Server Profiler,
capturing these queries incurs some overhead on the server, so be cautious about sampling
every query on your production servers.

You can configure all of these properties by using the properties window for SSAS inside of
SSMS.

You can run the wizard by connecting to SSAS in SSMS, right-clicking on a cube partition in
Object Explorer, and then selecting Usage Based Optimization. It can also be run from the
Aggregations tab in the cube designer in BIDS. After you start the wizard, select which parti-
tions you want to work with, and then ask SSAS to design aggregations based on any combi-
nation of the following parameter values: beginning date, ending date, specific user or users,
and quantity of frequent queries by percentage of total. After you’ve configured the previous
settings, the Usage-Based Optimization Wizard presents you with a sample list of queries
to select from. You then select which of these queries you’d like SSAS to design aggrega-
tions for. As you complete the configuration and run the wizard, it will produce a list of sug-
gested aggregations. These can immediately be implemented or be saved as a script for later
execution.

SQL Server Profiler


Another interesting tool you can use to help you design aggregations more intelligently is
SQL Server Profiler. In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we
looked at SQL Server Profiler’s ability to capture activity on the SSAS server. There we saw
that you could configure the capture, which is called a trace, to capture only specific types of
information. We saw that you can filter for MDX queries and other such items.

In the context of aggregation design, SQL Server Profiler has a couple of uses. The first is
to help you go beyond the results of the Usage-Based Optimization Wizard by creating a
trace that captures MDX queries and then filter the results to find the problematic (that is,
long-running) queries. To improve the query response times for such queries, you can either
attempt to rewrite the MDX statement (or expression) or design specific aggregations. In
most situations, adding aggregations produces improved query results with less effort than
rewriting the MDX query.
276 Part II Microsoft SQL Server 2008 Analysis Services for Developers

If you select Show All Events on the Events Selection tab of the Trace Properties dialog box,
you’ll see that you have a number of advanced capture options in the Query Processing
area. These are shown in Figure 9-11. Note that included in these options is Get Data From
Aggregation. Selecting this option is a very granular way for you to verify that particular MDX
queries are using particular aggregations.

Figure 9-11 Selecting the Show All Events view displays aggregation-specific events.

Note As with RDBMS query tuning, OLAP query tuning is a time-consuming process. We’ve
presented the information in this chapter in the order in which you should use it—that is, use
the wizards and tools prior to undertaking manual query tuning. Also, we caution that query
tuning can only enhance performance—it cannot compensate for poor (or nonexistent) star
schema source design. In most production situations, we don’t need to use any of the advanced
procedures we discuss here because performance is acceptable without any of these additional
processes.

In SQL Server 2008, several enhancements to speed up query execution have been intro-
duced. Some of these enhancements are completely automatic if your queries use the
feature—such as more efficient subspace calculations. In other words, SSAS 2008 divides the
space to separate calculated members, regular members, and empty space. Then it can bet-
ter evaluate cells that need to be included in calculations. Other types of internal optimiza-
tions require meeting a certain set of prerequisites.
Chapter 9 Processing Cubes and Dimensions 277

After you’ve identified problematic queries and possible aggregations to be added to your
OLAP cube’s partitions, how do you do create these specific aggregations? This is best done
via a new capability on the BIDS cube designer Aggregations tab.

Aggregation Designer: Advanced View


To access the capability to manually create either aggregation designs (collections of
aggregations) or individual aggregations, you switch to advanced view in the aggregations
designer in BIDS by clicking on the Advanced View button (the fifth button from the left) on
the toolbar. After selecting a measure group and either creating a new aggregation design
or selecting an existing one, you can create aggregations one by one by clicking on the New
Aggregation button on the same toolbar. Figure 9-12 shows the toolbar of this designer in
the advanced view.

New Aggregation

Figure 9-12 The advanced view of the BIDS cube aggregations designer allows you to create individual
aggregations.

In this view, you’ll create aggregation designs. These are groups of individual aggregations.
You can also add, copy, or delete individual aggregations. After you’ve grouped your newly
created aggregations into aggregation designs, you can then assign these named aggrega-
tion designs to one or more cube partitions. The advanced view provides you with a very
granular view of the aggregations that you have designed for particular attributes. We will
use these advanced options as part of query tuning for expensive and frequently accessed
queries.

In addition to designing aggregations at the attribute level, you also have the option of
configuring four advanced properties for each attribute. The advanced designer and the
Properties sheet for the selected attribute are where you would make these configurations.
278 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Although this advanced aggregation designer is really powerful and flexible, we again remind
you that, based on our experience, only advanced developers will use it to improve the per-
formance of a few queries by adding aggregations to affected dimensional attributes.

implementing Advanced Storage with MOLAP, HOLAP,


or rOLAP
When you want to modify the type of storage for a particular cube partition, simply click on
the Partition tab for that partition in the BIDS cube designer. Alternatively, you can create
new partitions for an existing cube by using this tab as well. Then click the Storage Settings
link below the selected partition. This opens a dialog box that allows you to adjust the stor-
age setting either by sliding the slider bar or setting custom storage options, which you
access by clicking the Options button. Figure 9-13 shows the default setting, MOLAP. As
we’ve discussed, for many partitions the default of MOLAP results in the best query perfor-
mance with acceptable cube availability (because of appropriate partition processing times).
You should change this option only if you have a particular business reason (as discussed ear-
lier in this chapter) for doing so.

Figure 9-13 You configure storage options for each partition in BIDS by using the Measure Group Storage
Settings dialog box.
Chapter 9 Processing Cubes and Dimensions 279

Although the slider provides a brief explanation of the other settings, you probably need a
more complete explanation to effectively select something other than the default. Note that
the proactive caching feature is enabled for all storage modes other than the default (simple
MOLAP). We’ll cover proactive caching in the next section of this chapter. Here’s an explana-
tion of the impact of each setting in the Measure Group Storage Settings dialog box:

■■ MOLAP (default) Source data (fact table rows) is copied from the star schema to the
SSAS instance as MOLAP data. Source metadata (which includes cube and dimension
structure and dimension data) and aggregations are copied (for dimension data) or
generated (for all other metadata and aggregations). The results are stored in MOLAP
format on SSAS, and proactive caching is not used.
■■ MOLAP (nondefault) Source data is copied. Metadata and aggregations are stored
in MOLAP format on SSAS. Proactive caching is enabled. This includes scheduled, auto-
matic, and medium- and low-latency MOLAP.
■■ HOLAP Source data is not copied, metadata and aggregations are stored in MOLAP
format on SSAS, and proactive caching is enabled.
■■ ROLAP This option is for cubes. Source data is not copied. Metadata is stored in
MOLAP format on SSAS. Aggregations are stored in the star schema database. For
dimensions, metadata is not copied; it is simply read from the star schema database
table or tables. Proactive caching is enabled.

Because proactive caching is invoked by default (although you can turn it off) for all changes
(from the default to some other setting) to partition storage settings, we’ll take a closer look
at exactly what proactive caching is and how it can work in your BI project.

Proactive Caching
Wouldn’t it be terrific if your BI solution allowed end users to access data with all the query
response speed and flexibility of SSAS, yet also allowed them to use a solution that didn’t
require the typical latency (often one business day) between the OLTP source data and OLAP
data? That’s the concept behind proactive caching. Think of configuring proactive caching as
the method by which you manage the MOLAP cache.

What is the MOLAP cache? It’s an in-memory storage location created automatically by SSAS.
The cache includes actual data and, sometimes, aggregations. This information is placed in
the cache area after MDX queries are executed against the SSAS cubes. Figure 9-14 shows a
conceptual rendering of the MOLAP cache.
280 Part II Microsoft SQL Server 2008 Analysis Services for Developers

How often should the


MOLAP cache be updated?
MOLAP
Cache

End-User
Client Applications

Star Schemas Based Analysis Server


on OLTP Sources Cube
Figure 9-14 Proactive caching settings allow you to manage the update/rebuild frequency of the MOLAP
cache.

Note One of the reasons queries to the SSAS store are so much faster than queries to RDBMS
query engines is that the former uses a MOLAP structure and cache. When you are consider-
ing configuring manual settings to manage cache refreshes, you can use the MSAS 2008:Cache
object in the Microsoft Windows Performance Monitor to measure which queries are being
answered by cache hits and which are being answered by disk calls.

Because every businessperson will tell you that it’s preferable to have minimal data latency,
why wouldn’t you use proactive caching in every BI solution? Proactive caching occurs in near
real time, but not exactly real time. And, importantly, the nearer you configure the MOLAP
cache refreshes to real time, the more overhead you add to the both the SSAS and the OLTP
source systems. These considerations are why SSAS has six options for you to choose from
when configuring proactive caching using the Measure Group Storage Settings dialog box
(using the slider). In addition to the slider configuration tool, you have the possibility of still
more finely grained control using the Custom Options dialog box accessed by clicking the
Options button in the Measure Group Storage Settings dialog box. Or you can gain even
more control by manually configuring the individual partition property values by using the
Properties dialog box. Of course, if you want to configure different storage and caching set-
tings for a subset of a cube, you must first define multiple partitions for fact tables upon
which the cube is based.

Note Proactive caching is not for every BI solution. Using it effectively necessitates that you
either read your OLTP data directly as the source for your cube or read a replicated copy of your
data. Another option is to read your OLTP data using the new snapshot isolation level available in
SQL Server 2005 or later. To use any of these options, your data source must be very clean. If you
need to do cleansing, validation, or consolidation during extract, transform, and load (ETL) pro-
cessing, proactive caching is not the best choice for your solution.
Chapter 9 Processing Cubes and Dimensions 281

Let’s start with a more complete explanation of the choices available in the Measure Group
Storage Settings dialog box (shown in Figure 9-13) as they relate to proactive caching set-
tings. The first choice you’ll make is whether to use MOLAP, HOLAP, or ROLAP data storage
for your cube. In most cases, because of the superior query performance, you’ll select some
version of MOLAP. The proactive caching configuration choices for MOLAP are as follows:

■■ Scheduled MOLAP When you select this setting, the MOLAP cache is updated
according to a schedule (whether the source data changes or not). The default is once
daily. This sets the rebuild interval to one day. This default setting is the one that we use
for the majority of our projects.
■■ Automatic MOLAP When you select this setting, the cache is updated whenever
the source data changes. It configures the silence interval to 10 seconds and sets a
10-minute silence override interval. We’ll say more about these advanced properties
shortly.
■■ Medium-Latency MOLAP With this setting, the outdated caches are dropped peri-
odically. (The default is a latency period of four hours.). The cache is updated when
data changes. (The defaults are a silence interval of 10 seconds and a 10-minute silence
override interval.)
■■ Low-Latency MOLAP With this setting, outdated caches are dropped periodi-
cally. (The default is a latency period of 30 minutes.) The cache is updated when data
changes. (The defaults are a silence interval of 10 seconds and a 10-minute silence
override interval.)

Tip To understand the silence interval property, ask the following question: “How long should
the cache wait to refresh itself if there are no changes to the source data?” To understand the
silence override interval property, ask the following question: “What is the maximum amount of
time after a notification (of source data being updated) is received that the cache should wait to
start rebuilding itself?”

If you select HOLAP or ROLAP, proactive caching settings are as follows:

■■ Real-Time HOLAP If you choose this setting, outdated caches are dropped immedi-
ately—that is, the latency period is configured as 0 (zero) seconds. The cache is updated
when data changes. (The defaults are a silence interval of 0 (zero) seconds and no
silence override interval.)
■■ Real-Time ROLAP With this setting, the cube is always in ROLAP mode, and all
updates to the source data are immediately reflected in the query results. The latency
period is set to 0 (zero) seconds.

As mentioned, if you’d like even finer-grained control over the proactive caching settings for
a partition, click the Options button in the Measure Group Storage Settings dialog box. You
282 Part II Microsoft SQL Server 2008 Analysis Services for Developers

then can manually adjust the cache settings, options, and notification values. Figure 9-15
shows these options.

Figure 9-15 The Storage Options dialog box for proactive caching

Notification Settings for Proactive Caching


You can adjust the notification settings (regarding data changes in the base OLTP store) by
using the Notifications tab of the Storage Options dialog box. There are three types of notifi-
cations available in this dialog box:

■■ SQL Server If you use this option with a named query (or the partition uses a query to
get a slice), you need to specify tracking tables in the relational source database. If you
go directly to a source table, you use trace events. This last option also requires that the
service account for SSAS has dbo permissions on the SQL database that contains the
tracking table.
■■ Client Initiated Just as you do for SQL Server, you need to specify tracking tables in
the relational source database. This option is used when notification of changes will be
sent from a client application to SSAS.
■■ Scheduled Polling If you use this option, you need to specify the polling interval time
value and whether you want to enable incremental updates and add at least one poll-
ing query. Each polling query is also associated with a particular tracking table. Polling
queries allow more control over the cache update process.
Chapter 9 Processing Cubes and Dimensions 283

Fine-Tuning Proactive Caching


Finally, here’s the most specific way to set proactive caching settings—use the Properties dia-
log box for a particular measure group, as shown in Figure 9-16.

Figure 9-16 Setting proactive caching properties through the Properties dialog box for a measure group

Here’s a summary of the settings available for proactive caching:

■■ AggregationStorage You can choose either Regular or MOLAP Only (applies to parti-
tions only).
■■ Enabled You can choose either True or False (turns proactive caching on or off).
■■ ForceRebuildInterval This setting is a time value. It indicates the maximum amount
of time to rebuild the cache whether the source data has changed or not. The default is
–1, which equals infinity.
■■ Latency This setting is a time value. It indicates the maximum amount of time to wait
to rebuild the cube. The default is –1, which equals infinity.
■■ OnlineMode You can choose either Immediate or OnCacheComplete. This setting
indicates whether a new cache will be available immediately or only after it has been
completely rebuilt.
■■ SilenceInterval This setting is a time value. It indicates the maximum amount of time
for which the source data has no transactions before the cache is rebuilt. The default is
–1, which equals infinity.
■■ SilenceOverrideInterval This setting is a time value. It indicates the maximum amount
of time to wait after a data change notification in the source data to rebuild the cache
and override the SilenceInterval value. The default is –1, which equals infinity.

Proactive caching is a powerful new capability that you might find invaluable in enhancing
the usability of your BI solution. As we mentioned earlier, the key consideration when decid-
ing whether to use proactive caching is the quality of your source data. It must be pristine
for this feature to be practical. In the real world, we’ve yet to find a client who has met this
important precondition.
284 Part II Microsoft SQL Server 2008 Analysis Services for Developers

We turn next to another approach to cube storage and processing. This is called a ROLAP
dimension.

ROLAP Dimensions
Recall that ROLAP partition-mode storage means that source data (fact table rows) is not
copied to the SSAS destination. Another characteristic of ROLAP partition storage is that
aggregations are written back to relational tables in the source schema. The primary reason
to use ROLAP partition storage is to avoid consuming lots of disk space to store seldom-
queried historical data. Queries to ROLAP partitions execute significantly more slowly
because the data is in relational, rather than multidimensional, format. With these consid-
erations noted, we share that ROLAP dimensions are used in a couple of situations: when
using rapidly (nearly constantly) changing dimensional metadata, and when using huge
dimensions. Huge dimensions are those that contain millions or even billions of members.
An example is dimensions used by FedEx that track each customer worldwide for an indefi-
nite time period. The storage limits for SQL Server tables (that is, the maximum number of
rows) are still larger than those in SSAS dimensions.

An example of rapidly changing dimension data is a dimension that contains employee infor-
mation for a fast food restaurant chain. The restaurant chain might have a high employee
turnover rate, as is typical in the fast food industry. However, it might be a business require-
ment to be able to retrieve the most current employee name from the Employee dimension
at all times, and that there can be no latency. This type of requirement might lead you to
choose a ROLAP dimension.

Tip Despite the fact that you might have business situations that warrant using ROLAP dimen-
sions, we encourage you to test to make sure that your infrastructure (that is, hardware and soft-
ware) will provide adequate performance given the anticipated load. Although the performance
in SSAS 2008 has been improved from previous versions, some of our customers still find that it’s
too slow when using ROLAP dimensions for production cubes. If you’re considering this option,
be sure to test with a production level of data before you deploy this configuration into a pro-
duction environment.

Like so many advanced storage features, ROLAP dimensions require the Enterprise edition of
SSAS. Because you typically use this feature only for dimensions with millions of members,
the dimensional attribute values will not be copied to and stored on SSAS. Rather, they will
be retrieved directly from the relational source table or tables. To set a dimension as a ROLAP
dimension, open the Dimension editor in BIDS, and in the Properties window for that dimen-
sion change the StorageMode property from the default MOLAP to ROLAP.

As mentioned in the introduction to this section, although ROLAP dimensions increase the
flexibility of your cube, we’ve not seem them used frequently in production BI solutions. The
Chapter 9 Processing Cubes and Dimensions 285

reason is that any queries to the relational source will always be significantly slower than que-
ries to MOLAP data or metadata.

Linking
A couple of other configuration options and features you might consider as you’re preparing
your cubes for processing are linked objects and writeback. We’ll also review error-handling
settings (in the upcoming “Cube and Dimension Processing Options” section) because they
are important to configure according to your business requirements and their configuration
values affect cube processing times. Let’s start with linked objects.

Linked objects are SSAS objects—for example, measure groups or dimensions from a dif-
ferent SSAS database (Analysis Services 2008 or 2005)—that you want to associate with the
SSAS database you are currently working on. Linked objects can also include KPIs, actions,
and calculations. The linked objects option can be used to overcome the SSAS 2008 limit of
basing a cube on a single data source view. It also allows you a kind of scalability because you
can use multiple servers to supply data for queries.

The ability to use linked objects in SSAS is disabled by default. If you want to use this option,
you need to enable the property by connecting to SSAS in SSMS, right-clicking the SSAS
server instance, selecting Properties, and then enabling linking. The properties you need to
enable are Feature\LinkToOtherInstanceEnabled and Feature\LinkFromOtherInstanceEnabled.
After you’ve done that, you can use the Linked Object Wizard in BIDS, which you can access
by right-clicking on the Dimensions folder in BIDS Solution Explorer and then by clicking on
New Linked Dimension. You’ll next have to select the data source from which you want to
link objects. Then you’ll access the Select Objects page of the Linked Object Wizard, where
you’ll select which object from the linked database you want to include in the current cube.
If objects have duplicate names—that is, dimensions in the original SSAS database and the
linked instance have the same name—the linked object names will be altered to make them
unique (by adding an ordinal number, starting with 1, to the linked dimension name).

As with using many other advanced configuration options, you should have a solid busi-
ness reason to use linking because it adds complexity to your BI solution. Also, you should
test performance during the pilot phase with production levels of data to ensure that query
response times are within targeted ranges.

Writeback
Writeback is the ability to store “what if” changes to dimension or measure data in a change
table (for measures) or in an original source table (for dimensions). With writeback, the delta
(or change value) is stored, so if the value changes from an original value of 150 to a new
value of 200, the value 50 is stored in the writeback table. If you are using the Enterprise
286 Part II Microsoft SQL Server 2008 Analysis Services for Developers

edition of SSAS 2008, you can enable writeback for a dimension or for a partition if certain
conditions are met.

To enable writeback for a dimension, you set the WriteEnabled property value of that dimen-
sion to True. You can also use the Add Business Intelligence Wizard to enable writeback for a
dimension. You cannot enable writeback for a subset of a dimension—that is, for individual
attributes. Writeback is an all-or-nothing option for a particular dimension.

An important consideration with writeback dimensions is for you to verify that your selected
client applications support writeback. Also, you must confirm that allowing writeback is con-
sistent with your project’s business requirements. Another consideration is that you must
specifically grant read/write permissions in their defined SSAS security roles to write-enabled
dimensions for end users who you want to have the ability to write to a dimension.

In our experience, writeback is not commonly enabled for BI projects. One business case
where it might be worth using it is when a cube is used for financial forecasting, particularly
in “what if” scenarios.

Note Writeback is not supported for dimensions of the following types: Referenced (snowflake),
Fact (degenerate), Many-to-Many, or Linked. The dimension table must be a single table, directly
related to the fact. As mentioned, you can write-enable only an entire dimension; there is no
mechanism to write-enable only specific attributes of a dimension.

You can also enable writeback for measure group partitions that contain only aggregates that
use the SUM aggregate value. An example of this from the Adventure Works sample cube is
the Sales Targets measure group, which includes one named partition called Sales_Quotas.
To enable writeback on a partition, navigate to the partition on the BIDS cube designer
Partitions tab, right-click on the partition, and then choose the Writeback Settings menu
option. Doing this opens the Enable Writeback dialog box, as shown in Figure 9-17. There you
configure the writeback table name, the data source, and the storage mode. Note that the
option to store writeback information in the more efficient and fast-to-query MOLAP mode
is new to SQL Server 2008. This results in significant performance improvement over the
previous version of SSAS, which allowed only ROLAP storage of measure group writeback-
enabled partitions.

As mentioned, if you intend to enable writeback for measure group partitions, you must
enable read/write access for the entire cube rather than for a particular measure group par-
tition in the SSAS security role interface. We recommend you verify that this condition is
in line with your project’s overall security requirements before you enable measure group
writeback.
Chapter 9 Processing Cubes and Dimensions 287

Figure 9-17 The Enable Writeback dialog box allows you to configure writeback options for qualifying
measure group partitions.

Cube and Dimension Processing Options


Now that we’ve covered storage, aggregations, partitions, and caching, we’re (finally!) ready
to review cube and dimension processing option types. Dimensions must be completely
and correctly processed either prior to or at the beginning of a cube process. The best way
to understand this is to keep in mind that dimensional data is the metadata or the structure
of the cube itself. So the metadata must be available before the data can be loaded into
the cube.

During development, you will most often do a full process on your cube whenever you need
to view the results of a change that you’ve made. This option completely erases and rebuilds
all data and metadata. For some customers, this simple method of updating the cube can be
used in production as well. What is happening here, of course, is a complete overwrite on
rebuild. This is practical only for the smallest cubes—those that are a couple of GBs in size
at a maximum. Most of the time, you’ll choose to use the more granular processing options
after you move your cube to a production environment, which will result in shorter process-
ing times and more consistent cube availability.

The first aspect of processing that you’ll have to determine for production cubes is whether
you’ll choose to separate processing of dimensions from the cube. Our real-world experience
is that about 50 percent of the time we advise clients to process dimensions before process-
ing cubes. The choice to separate the process is usually due to dimension processing com-
plexity or size.
288 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In this section, we’ll review the process for processing using BIDS. In future chapters, we’ll
discuss automating these cube and dimension refreshes using SSIS packages. We’ll start with
cube processing options. To process a cube in BIDS, right-click on the cube name in Solution
Explorer in BIDS and then select Process. You’ll see the dialog box shown in Figure 9-18.

Figure 9-18 To process a cube, right-click on the cube name in BIDS and then select the process type and
other options in the Process Cube dialog box.

You can also process cubes and dimensions from SSMS by using this same process. Following
is a more complete explanation of the available selections for process options for both cubes
and dimensions. Some options are available only for cubes or only for dimensions. We note
that in the following list:

■■ Default With this option, SSAS detects the current state of the cube or dimension,
and then does whatever type of processing (that is, full or incremental) that is needed
to return the cube or dimension to a completely processed state.
■■ Full With this option, SSAS completely reprocesses the cube or dimension. In the case
of a cube, this reprocessing includes all the objects contained within it—for example,
dimensions. Full processing is required when a structural change has been made to a
cube or dimension. An example of when Full processing is required for a dimension is
when an attribute hierarchy is added, deleted, or renamed. The cube is not available for
browsing during a Full processing.
■■ Data If you select this option, SSAS processes data only and does not build any aggre-
gations or indexes. SSAS indexes are not the same thing as relational indexes. They are
generated and used by SSAS internally during the aggregation process.
■■ Unprocess If you select this option, SSAS drops the data in the cube or dimension.
If there are any lower-level dependent objects—for example, dimensions in a cube—
those objects are dropped as well. This option is often used during the development
phase of a BI project to quickly clear out erroneous results.
Chapter 9 Processing Cubes and Dimensions 289
■■ Index With this option, SSAS creates or rebuilds indexes for all processed partitions.
This option results in no operation on unprocessed objects.
■■ Structure (cubes only) With this option, SSAS processes the cubes and any contained
dimensions, but it does not process any mining models.
■■ Incremental (cubes only) With this option, SSAS adds newly available fact data and
processes only the affected partitions. This is the most common option used in day-to-
day production.
■■ Update (dimensions only) If you select this option, SSAS forces an update of dimen-
sion attribute values. Any new dimension members are added, and attributes of exist-
ing members are updated.

Note Aggregation processing behavior in dimensions depends on the AttributeRelationship


RelationshipType property. If this property is set to the default value (Flexible), aggregations are
dropped and re-created on an incremental process of the cube or update of the dimension. If
it is set to the optional (or nondefault) value (Rigid), aggregations are retained for cube/dimen-
sion incremental updates. Also, if you set the dimension ProcessingMode for a dimension to
LazyAggregations, flexible aggregations are reprocessed as a background task and end users can
browse the cube while this processing is occurring.

An optimization step you can take to reduce processing times for your dimensions is to turn
off the AttributeHierarchyOptimizedState property for dimensional attributes that are only
viewed infrequently by end users.

Tip To identify infrequently queried attributes, you can either use a capture of queries from SQL
Server Profiler or read the content of the LogTable after running the Query Optimization Wizard.

To adjust the AttributeHierarchyOptimizedState property, open the Properties dialog box for
the particular dimension attribute and then set that property value to NotOptimized. Setting
the value to NotOptimized causes SSAS to not create supplementary indexes (such as those
that are created by default) for this particular attribute during dimension or cube process-
ing. This can result in slower query times, so change this setting only for rarely browsed
attributes.

The final considerations when processing cubes and dimensions are whether you’ll need
to adjust any of the processing options. You access these options by clicking the Change
Settings button in the Process Cube dialog box. Clicking this button displays the Change
Settings dialog box, which is shown in Figure 9-19. This dialog box contains two tabs:
Processing Options and Dimension Key Errors.
290 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 9-19 The Dimension Key Errors tab in the Change Settings dialog box allows you to specify custom
error behavior responses when processing a cube.

On the Processing Options tab, you can set the following values:

■■ Parallel Processing or Sequential Processing (if parallel, maximum number of parallel


tasks must be specified)
■■ Single or multiple transactions (for sequential processing)
■■ Writeback Table (either Use Existing, Create, or Create Always)
■■ Process Affected Objects (either Off or On)

The Dimension Key Errors tab, shown in Figure 9-19, allows you to configure the behavior
of errors during processing. By reviewing this tab, you can see that you can either use the
default error configuration or set a custom error configuration. When using a custom error
configuration, you can specify actions to take based on the following settings:

■■ The Key Error Action drop-down list enables you to specify a response to key errors.
Figure 9-19 shows the Convert To Unknown option selected in the drop-down list.
■■ The Processing Error Limit section enables you to specify a limit for the number of
errors that are allowed during processing. Figure 9-19 shows the Number Of Errors
option set to 0 (zero) and the On Error Action drop-down list with the Stop Processing
item selected. This configuration stops processing on the first error.
Chapter 9 Processing Cubes and Dimensions 291
■■ The Specific Error Conditions section includes the Key Not Found, Duplicate Key, Null
Key Converted To Unknown, and Null Key Not Allowed options.
■■ The Error Log Path text box allows you to specify the path for logging errors.

Although you had probably been processing test cubes for a while prior to reading this chap-
ter, you’ve probably gained a bit more insight into what actually happens when you run the
process action. As we’ve seen, when you execute the Process command on a cube or dimen-
sion, the step-by-step output of processing is shown in the Process Progress dialog box. In
production, you usually automate the cube/dimension processing via SSIS packages, using
the cube or dimension processing tasks that are included as a part of the SSIS control flow
tasks.

As we mentioned previously in this chapter, you can choose to process one or more dimen-
sions rather than processing an entire OLAP cube. Cube processing automatically includes
associated dimension processing. In fact, processing a cube will execute in the following
order: dimension (or dimensions) processing and then cube processing. It should be clear by
this point why the processing is done in this order. The metadata of the cube includes the
dimension source data. For example, for a Customer dimension, the source rows for each cus-
tomer in the customer table create the structure of the cube. In other words, the number of
(in our example, customer) source rows loaded during dimension processing determines the
size of the cube container. You can visualize this as one of the sides of the cube—that is, the
more source (customer) rows there are, the longer the length of the particular side will be. Of
course, this is not a perfect analogy because cubes are n-dimensional, and most people we
know can’t visualize anything larger than a four-dimensional cube.

If you’re wondering how to actually visualize a four-dimensional cube, think of a three-


dimensional cube moving through time and space.

Because the source rows for a dimension are metadata for an OLAP cube, the cube dimen-
sions must be successfully processed prior to loading cube data (which is data from any
underlying fact tables). When you select a particular type of cube processing—that is, Full,
Incremental, Update, and so on—both the related dimensions and then the cube are pro-
cessed using that method. We find that complete cube processing has been the most com-
mon real-world scenario, so we focused on completely explaining all the options available in
that approach in this chapter. As we mentioned previously, we have occasionally encountered
more granular processing requirements related to one or more dimensions. SSAS does sup-
port individual dimension processing using the processing options that we described in the
list earlier in this section.
292 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Summary
Although there are myriad storage and processing options, we find that for most projects the
default storage method of MOLAP works just fine. However, partitions do not create them-
selves automatically, nor do aggregations. Intelligent application of both of these features
can make your cubes much faster to query, while still keeping processing times to an accept-
ably fast level. If you choose to implement any of the advanced features, such as proactive
caching or ROLAP dimensions, be sure you test both query response times and cube process-
ing times during development with production-load levels of data.

In the next two chapters, we’ll look at the world of data mining. After that, we’ve got a
got bit of material on the ETL process using SSIS to share with you. Then we’ll explore the
world of client tools, which include not only SSRS, but also Office SharePoint Server 2007,
PerformancePoint Server, and more.
Chapter 10
Introduction to MDX
In this chapter and the next one, we turn our attention to Multidimensional Expressions
(MDX) programming. MDX is the query language for OLAP data warehouses (cubes).
Generally speaking, MDX is to OLAP databases as Transact-SQL is to Microsoft SQL Server
relational databases. OLAP applications use MDX to retrieve data from OLAP cubes and to
create stored and reusable calculations or result sets. MDX queries usually comprise several
items:

■■ Language statements (for example, SELECT, FROM, and WHERE)


■■ OLAP dimensions (a geography, product, or date hierarchy)
■■ Measures (for example, dollar sales or cost of goods)
■■ Other MDX functions (Sum, Avg, Filter, Rank, and ParallelPeriod)
■■ Sets (ordered collections of members)

We take a closer looks at all of these items, mostly by examining progressively more complex
MDX queries and statements. Although MDX initially appears similar to Transact-SQL, there
are a number of significant differences between the two query languages. This chapter cov-
ers the fundamentals of MDX and provides brief code samples for most of the popular MDX
language features. The next chapter provides some richer examples of how you can leverage
MDX in your business intelligence (BI) solution. In this chapter, we first take a closer look at
core MDX syntax and then discuss several commonly used MDX functions. Unlike the rest of
this book, this chapter and the next one focus on language syntax. We feel that you’ll learn
best by being able to try out successively more complex MDX queries.

As with Transact-SQL, MDX query processing relies not only on you writing the most effi-
cient data access query statements, but also on that code making efficient use of the internal
query processing mechanisms. Although the focus of this chapter is on understanding and
writing efficient code, we also introduce some core concepts related to query processing
architecture in SSAS here.

The Importance of MDX


So far in this book, we haven’t emphasized MDX syntax. Although we have implemented
some BI solutions that included only minimal manual MDX coding, we find that understand-
ing the MDX language is important for building a successful BI project. One reason for this
is because tools, such as SQL Server Management Studio (SSMS) and Business Intelligence
Development Studio (BIDS), that expose OLAP cubes for reporting/dashboard purposes

293
294 Part II Microsoft SQL Server 2008 Analysis Services for Developers

contain visual designers to create output. Most of these designers (for example, the design-
ers in SSRS and PerformancePoint Server) generate MDX code. Why then is it still important,
even critical sometimes, for you to know MDX?

Although most tools do an adequate job of allowing users to produce simple output from
OLAP cubes, these tools aren’t designed to be able to automatically generate MDX queries
for every possible combination of options that might be required for certain reports that are
part of your project’s business requirements. Here are some examples of real-world problems
for which you might need to write manual MDX queries:

■■ To create a certain complex sort order for output


■■ To rank items in a report by some criteria
■■ To display hierarchical data in a way that the design tool doesn’t fully support
■■ To create some trend-based measures in a report, such as dollars sold for the same
time period of the previous year or dollar sales to date
■■ To create KPIs with complex logic that compares the KPI value of one period to the KPI
value of another period

Even when OLAP reporting tools provide designers that automatically generate MDX code,
you might still find yourself needing to modify or rewrite the code. Even if you need to add
only one line of code or just tweak a generated MDX expression, you still need to know how
and where to do it. We believe that successful OLAP developers need to learn both the fun-
damentals of MDX as well as the more advanced features of the language.

In this chapter, we focus on MDX queries that produce a result set. When you’re working with
OLAP cubes, you’ll use both MDX queries and MDX expressions. To understand the differ-
ence between an MDX query and an MDX expression, you can think of a Microsoft Office
Excel spreadsheet. An OLAP cube MDX expression produces a calculated cell as a result. In
this way, it’s similar to an Excel cell formula—it calculates new values and adds them to the
cube output that is displayed. The difference is that these calculations are automatically
applied to multiple cells in the cube as specified in the scope of the MDX expression.

An OLAP cube MDX query produces a result set, which is called a cellset. This cellset is nor-
mally displayed in a matrix-type output. This is similar to applying a filter to an Excel work-
book: the filter produces new output, which is some reduced subset of the original data.

Note We find that when developers with a Transact-SQL background initially see MDX syntax,
they conclude that MDX will be similar to different dialects of SQL. Although that conclusion is
quite understandable, it’s actually presumptuous and, in fact, counterproductive. We find that
developers who can put aside comparisons between Transact-SQL and MDX learn MDX more
quickly. The reason for this is that attempting to find an equivalent for each MDX concept in
Transact-SQL is actually counterproductive because there aren’t always direct comparisons.
Chapter 10 Introduction to MDX 295

Writing Your First MDX Queries


We’ll begin by entering queries directly into the query editor window in SSMS. To do this,
start SSMS, and connect to your SQL Server Analysis Services (SSAS) instance and Adventure
Works DW sample. Next, select the OLAP cube from the SSMS Object Explorer, and then click
the New Query button to open a new MDX query editor window. The principal use of new
queries is twofold: to generate particular types of reports or as the result of directly manipu-
lating cube data in a client interface, such as a pivot table that supports direct query genera-
tion. A basic MDX query contains a SELECT statement, a definition for the COLUMNS axis, a
definition for the ROWS axis, and the source cube. So a simple but meaningful query we can
write against Adventure Works is shown here as well as in Figure 10-1:

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Customer].[State-Province] ON ROWS
FROM [Adventure Works]

FIgure 10-1 Simple MDX query

You must specify the COLUMNS axis value first and the ROWS axis value second in an MDX
query. You can alternatively use the axis position—that is, COLUMNS is axis(0) and ROWS is
axis(1)—in your query. It’s helpful to understand that you can return a single member or a set
of members on each axis. In this case, we’ve written a simply query—one that returns a single
member on both axes. In the FROM clause, you’ll reference a cube name or the name of what
is called a dimension cube. The latter name type is prefaced with a dollar sign ($). We’ll talk
more about this in a later section.

Tip It is common in MDX query writing to capitalize MDX statement keywords, and to use camel
casing (that is, PeriodsToDate) for MDX functions. This is a best practice and makes your code
more readable for other developers. MDX itself (other than when performing string matches) is
not case sensitive. MDX is also not space sensitive (other than for string matches), so it’s common
to place keywords on a new line, again for readability.
296 Part II Microsoft SQL Server 2008 Analysis Services for Developers

MDX Object Names


Individual object names are surrounded by brackets. The syntax rules require square brackets
around object names in only the following circumstances:

■■ Object names contain embedded spaces.


■■ Object names are the same as MDX keywords.
■■ Object names begin with a number rather than a string.
■■ Object names contain embedded symbols.

Also, object names that are member names can be either explicitly named—for example,
[Customer].[State-Province].California, CA—or referenced by the key value—for example,
[Customer].[State-Province].&[CA]&[US]. The ampersand (&) is used with key values to identify
the dimension member. The key can be a multipart key, as in the preceding example (where
CA is one value in the key and US is the other). A query will always run if you choose to
include all object names in square brackets, and we consider this to be a best syntax practice.
Note that dimensions and their related objects (that is, members, levels, and hierarchies) are
separated by a period.

Object names are also called tuples. Here is the definition of a tuple from SQL Server Books
Online:

A tuple uniquely identifies a cell, based on a combination of attribute members


that consist of an attribute from every attribute hierarchy in the cube. You do not
need to explicitly include the attribute member from every attribute hierarchy. If a
member from an attribute hierarchy is not explicitly listed, then the default member
for that attribute hierarchy is the attribute member implicitly included in the tuple.

If your query consists of a single tuple, the delimiter used to designate a tuple—paren-
theses—is optional. However, if your query contains multiple tuples from more than one
dimension, you must separate each tuple by a comma and enclose the group of tuples
in parentheses. The order of the members in the tuple does not matter; you’re uniquely
identifying a cell in the cube simply by listing the values and then enclosing your list in
parentheses.

Other Elements of MDX Syntax


Here are a few other basic things to know about MDX:

■■ Single-line code comments are created with the double forward slash (//) or double
hyphen (--) delimiters, just as they are in Transact-SQL.
■■ Multiline comments repeat this syntax for each line of the comment: /* line 1 */ . . .
/* line 2 */ . . . /* line n */
Chapter 10 Introduction to MDX 297
■■ Operators—for example, the plus sign (+), minus sign (–), forward slash (/), and others—
work the same way that they do in Transact-SQL. However, the asterisk (*) can have a
special meaning, which will be discussed later in this chapter. Operator precedence is
also the same in MDX as it is in Transact-SQL. Using angled brackets (< >) means “not
equal to” in MDX.
■■ MDX contains reserved keywords. You should avoid naming objects using these words.
If used in a query, they must be delimited with square brackets. See the SQL Server
Books Online topic “MDX Reserved Words” for a complete list.
■■ MDX contains functions. These functions perform set-creation operations, hierarchy
navigation, numeric or time-based calculations, and more. We’ll take a closer look at
many functions throughout these two chapters.

If you make a syntax error in your MDX query, the query tool in SSMS shows you a red squig-
gly line under the code in error. Executing the query with an error in it will result in some
kind of information about the error being displayed in the Messages pane. Query error infor-
mation is much less detailed than you’re used to if you have experience with Transact-SQL.

The query shown earlier in Figure 10-1 generates an aggregated, one-column result of a
little over $29 million. Although that result might be helpful if you simply want to know the
total of Internet sales for the entire OLAP cube, you’ll usually need to “slice and dice” the
Adventure Works cube by using different conditions. This chapter demonstrates how to
break down the cube by different criteria.

First, let’s produce a list of Internet sales by state. (State is one of the Adventure Works OLAP
Customer dimension levels.) Our query, shown in the following code sample, lists the states
on the ROWS axis:

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Customer].[State-Province].Members ON ROWS
FROM [Adventure Works]

You can type this query into a query window (such as the one in SSMS), or you can use
the designer to drag and drop measures and dimension elements from the cube list into
the query window. For better productivity, we recommend that you drag and drop object
names from the metadata listing into the position in the query text where you want to insert
them. Object names can become quite long and cumbersome to type, particularly when
you’re including objects that are part of a deep hierarchy, such as Customer->Customer
Geography->Country->State->City, and so on. In Figures 10-2 and 10-3, we show the meta-
data available for the Adventure Works sample cube.

In Figure 10-2, we expanded the Measures metadata for the Adventure Works sample cube,
showing the contents of the Internet Sales measure group. This group includes two types of
measures: stored measures and calculated members. Stored measures are indicated by a bar
chart icon. The first stored measure shown is Internet Extended Amount. Calculated members
298 Part II Microsoft SQL Server 2008 Analysis Services for Developers

are indicated by a bar chart plus a calculator icon. The first calculated member shown is
Growth In Customer Base. From an MDX query perspective, both types of measures are que-
ried in an equivalent way.

FIgure 10-2 Available measures from the Adventure Works cube

Tip Queries to stored measures return results more quickly than those to calculated members.
We recommend that you test queries to calculated members under production-load levels dur-
ing prototyping. Based on the results, you might want to convert some calculated members to
stored measures. You will have this option available only with some types of calculated members.
For example, ratios typically have to be calculated dynamically to return accurate results. Of
course, converting calculated members to stored measures increases the cube storage space and
processing times, so you must base your design on business requirements.

In Figure 10-3, we’ve opened the Metadata tab to expose the items contained in a particular
dimension—in this case, Customer. These items include all of the attribute hierarchy display
folders, dimension members, dimension levels, and dimension hierarchies. Recall that all
dimensions have members and at least one level. It’s optional for dimensions to have hier-
archies and display folders. If a dimension has a defined hierarchy, to query objects in that
Chapter 10 Introduction to MDX 299

dimension, you must specify the particular hierarchy as part of the object name. Hierarchy
display folders are for display only; they are not used in MDX queries.

FIgure 10-3 Available dimensions from the Adventure Works cube

The number of dots next to the level names—in our example, Country, State-Province, and
so on—indicates the position in the hierarchy, with smaller numbers indicating a higher posi-
tion in the hierarchy. The grouping of dots in the shape of a pyramid—in our case, next to
Customer Geography—indicates that the object is a dimensional hierarchy.

MDX Core Functions


Figure 10-4 shows the results after we changed the query to return a set of members on the
ROWS axis. We did this by specifically listing the name of a dimension level in the MDX query
(State-Province) and by using the MDX Members function. The Members function returns a
list of all members in the State dimension level, plus an All Customers total. The All member
is included in every dimension by default unless it has been specifically disabled, or hidden.
The All member is also the default returned member when none is specified. The default
member can be changed in the design of a dimension or by associating specific default
members with particular security groups—that is, members of the WestRegion security group
will see the West Region as the default returned member, members of EastRegion security
group will see the East Region as the default returned member, and so on.
300 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-4 Result set for all members of the Customer/State dimension

If we simply wanted the states, without a total row (or the All member), we could change
Members to Children as shown here:

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Customer].[State-Province].Children ON ROWS
FROM [Adventure Works]

Note that some states or provinces contain null values, indicating there is no value for that
cell. You can filter out the null values with a NON EMPTY keyword as shown in the following
code. The query results are shown in Figure 10-5. Adding the NON EMPTY keyword makes
your queries more efficient because the results generated are more compact.

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


NON EMPTY [Customer].[State-Province].Children ON ROWS
FROM [Adventure Works]

FIgure 10-5 Results of querying with NON EMPTY


Chapter 10 Introduction to MDX 301

Of course, you’ll sometimes want more than one column to appear in a result set. If you want
to include multiple columns, place them inside curly braces ({ }). Using curly braces explicitly
creates a set result in MDX terminology. A set consists of one or more tuples from the same
dimension. You can create sets explicitly, by listing the tuples, as we’ve done here, or you can
use an MDX function to generate a set in a query. Explicit set creation is shown in the follow-
ing code, and the query result is shown in Figure 10-6:

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY [Customer].[State-Province].Children ON ROWS
FROM [Adventure Works]

FIgure 10-6 The results of retrieving multiple columns with a NON EMPTY query

You might be wondering, “But what if I want to break out dollar sales and gross profit by
the Product dimension’s Category level as well as the Customer dimension’s State level?”
You simply join the two dimensions together (creating a tuple with multiple members) with
parentheses on the ROWS axis of the MDX query as shown in the following code. Unlike cre-
ating a set, by using curly braces as you just did on the COLUMNS axis, here you’re simply
asking the MDX query processor to return more than one set of members on the ROWS axis
by using the comma delimiter and the parentheses to group the sets of members. The query
results are shown in Figure 10-7.

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY ( [Customer].[State-Province].Children,
[Product].[Category].Children ) ON ROWS
FROM [Adventure Works]
302 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-7 Results of creating a tuple

Alternatively, you could use the asterisk (*) symbol to join the two dimensions. The updated
query is shown here, and the query results are in Figure 10-8:

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY [Customer].[State-Province].Children *
[Product].[Category].Children ON ROWS
FROM [Adventure Works]

FIgure 10-8 Result set with two columns and two dimension sets

Up to this point, the result sets have appeared in the order specified in the cube—that is, the
dimensional attribute order configured in the original meta data. You can sort the results
with the Order function. The Order function takes on three parameters: the set of members
you want to display, the measure you’re sorting on, and the sort order itself. So if you want to
Chapter 10 Introduction to MDX 303

sort Customer State-Province on the Internet Sales Amount in descending order, you’d write
the following query, which produces the output shown in Figure 10-9:

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY Order(
[Customer].[State-Province].Children,
[Measures].[Internet Sales Amount],DESC) ON ROWS
FROM [Adventure Works]

FIgure 10-9 Sorted results can be obtained by using the Order function.

If you want to include an additional value on the ROWS axis, such as Product Category, you
would write the following query. The results are shown in Figure 10-10.

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY Order(
[Product].[Category].Children *
[Customer].[State-Province].Children ,
[Measures].[Internet Sales Amount],DESC)
ON ROWS
FROM [Adventure Works]

FIgure 10-10 Sorting can be done on multiple dimension sets.


304 Part II Microsoft SQL Server 2008 Analysis Services for Developers

However, when you generate the result set with this query, you’ll see State-Province states
sorted by Internet Sales Amount, but within a dimension sort for the Category.Product
categories. This is the equivalent of the Transact-SQL construction of ORDER BY Product
Category, Sales Amount DESC. If you want to sort from high to low on every Product
Category/Customer State-Province combination, regardless of the dimension order, you use
the BDESC keyword instead of the DESC keyword. The BDESC keyword effectively breaks any
dimension or hierarchy definition and sorts purely based on the measure.

You can retrieve specific dimension members by including them in braces, which you’ll recall
from earlier discussion in this chapter, explicitly creates a set. For instance, if you want to
retrieve the sales amount for Caps, Cleaners, Fenders, and Gloves, you write the following
query:

SELECT {[Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] }
ON COLUMNS,
{ [Product].[SubCategory].[Caps] ,
[Product].[SubCategory].[Cleaners] ,
[Product].[SubCategory].[Fenders] ,
[Product].[SubCategory].[Gloves] }
ON ROWS
FROM [Adventure Works]

The results are shown in Figure 10-11.

FIgure 10-11 Retrieving specific dimension members

If Caps, Cleaners, Fenders, and Gloves are consecutively defined in the dimension level, you
can use the colon (:) symbol to retrieve all the members in between the two expressions just
as you use it in Excel to retrieve all members of a list, as shown here:

SELECT {[Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] }
ON COLUMNS,
{ [Product].[SubCategory].[Caps] :
[Product].[SubCategory].[Gloves] }
ON ROWS
FROM [Adventure Works]
Chapter 10 Introduction to MDX 305

The : symbol is used mostly for date ranges. Here is an example that displays dates from a
date range. The results are shown in Figure 10-12.

SELECT {[Date].[Fiscal].[February 2002] : [Date].[Fiscal].[May 2002]} *


{ [Measures].[Internet Sales Amount],
[Measures].[Internet Tax Amount]} ON COLUMNS,
NON EMPTY [Product].[SubCategory].CHILDREN ON ROWS
FROM [Adventure Works]

FIgure 10-12 Results of using the colon to include a date range

Finally, you can place a WHERE clause at the end of a query to further “slice” a result set. In
the following example, you won’t see references to “2005” or “Marketing” in the result set;
however, the aggregations for sales amount and gross profit will include only data from fis-
cal year 2005 and where the Sales Reason includes Marketing. The query results are shown
in Figure 10-13. Although the WHERE clause acts like a kind of a filter for the result set, MDX
has a separate Filter function that performs a different type of filtering. We’ll cover that in the
next section.

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
NON EMPTY Order(
[Product].[Category].Children *
[Customer].[State-Province].Children ,
[Measures].[Internet Sales Amount],BDESC)
ON ROWS
FROM [Adventure Works]
WHERE ([Date].[Fiscal Year].[FY 2005], [Sales Reason].[Marketing])

FIgure 10-13 Results of adding a WHERE clause


306 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Filtering MDX result Sets


MDX provides many different options for filtering result sets. The specific filter option you’ll
use will be based on the type of filter you want to execute. Once again, seeing specific exam-
ples can go a long way toward understanding the language. For example, if you want to filter
on only product subcategories with a total Internet Gross Profit of at least $1 million and
Internet sales of at least $10 million, you can use the Filter function. In the following example,
you simply wrap the values on ROWS inside a Filter function and then specify a filter expres-
sion. You can connect filter expressions with AND/OR operands, as shown here:

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
Filter([Product].[SubCategory].Children ,
[Measures].[Internet Gross Profit] > 1000000 AND
[Measures].[Internet Sales Amount] > 10000000)
ON ROWS
FROM [Adventure Works]

The results are shown in Figure 10-14.

FIgure 10-14 Results of using the Filter function

To help you understand the difference between using the WHERE keyword and the Filter
function, we’ll combine both in a query. You can combine a Filter statement with a WHERE
clause, as shown here. The results are shown in Figure 10-15.

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
Filter([Product].[SubCategory].Children ,
[Measures].[Internet Gross Profit] > 1000 AND
[Measures].[Internet Sales Amount] > 10000) ON ROWS
FROM [Adventure Works]
WHERE ( [Customer].[Customer Geography].[Country].[Canada],
[Promotion].[Promotion Type].[New Product] )

FIgure 10-15 Results of using the WHERE keyword and the Filter function

At this point, you might be wondering, “In what situations should I use WHERE, and when
should I use Filter?” Good question! The answer is that you should use Filter when you’re
filtering results against a measure (for example, dollar sales greater than a certain amount,
average price less than a certain amount, and so on). If you want to filter on specific
Chapter 10 Introduction to MDX 307

dimension members, you’re best off placing them in a WHERE statement. The one exception
to this is if you want to filter on some substring of a dimension member.

For example, suppose you want to filter on product subcategories that have the word BIKE in
the description. You can use the InStr function inside a filter—and you must also drill down
to the CurrentMember.Name function of each dimension member, as shown in the follow-
ing code. The CurrentMember function returns information about the select member. The
returned object contains a couple of properties, such as Name and Value. These properties
allow you to specify exactly what type of information you want to return about the currently
selected member from the cube. The results of this query are shown in Figure 10-16.

SELECT { [Measures].[Internet Sales Amount],


[Measures].[Internet Gross Profit] } ON COLUMNS,
Filter([Product].[SubCategory].Children ,
[Measures].[Internet Gross Profit] > 1000 AND
[Measures].[Internet Sales Amount] > 10000 AND
InStr([Product].[SubCategory].CurrentMember.Name, 'BIKE') > 0)
ON ROWS
FROM [Adventure Works]

FIgure 10-16 Results of using the InStr function

Calculated Members and Named Sets


Although OLAP cubes contain summarized and synthesized subsets of your raw corporate
data, your business requirements might dictate that you add more aggregations. These are
called calculated members because you most often create them by writing MDX expressions
that reference existing members in the measures dimension.

For instance, if you want to create a calculated member on Profit Per Unit (as Internet Gross
Profit divided by Internet Order Quantity) and use that as the sort definition in a subsequent
MDX query, you can use the WITH MEMBER statement to create a new calculation on the fly.
For instance, the following query creates a calculated member that not only appears in the
COLUMNS axis, but is also used for the Order function. The query results are shown in Figure
10-17. Note also that you do not need to separate multiple calculated members with com-
mas. Also in this code example, we introduce the FORMAT_STRING cell property. This allows
you to apply predefined format types to cell values returned from a query. In our case, we
want to return results formatted as currency.
308 Part II Microsoft SQL Server 2008 Analysis Services for Developers

WITH MEMBER [Measures].[Profit Per Unit] AS


[Measures].[Internet Gross Profit] / [Measures].[Internet Order Quantity],
FORMAT_STRING = 'Currency'

MEMBER [Measures].[Profit Currency] AS


[Measures].[Internet Gross Profit], format_string = 'CURRENCY'

SELECT { [Measures].[Internet Order Quantity],


[Measures].[Profit Currency],
[Measures].[Profit Per Unit] } ON COLUMNS,
NON EMPTY Order(
[Product].[Product].Children ,
[Measures].[Profit Per Unit],DESC) ON ROWS
FROM [Adventure Works]

FIgure 10-17 WITH MEMBER results

In addition to creating calculated members, you might also want to create named sets. As
we learned in Chapter 8, “Refining Cubes and Dimensions,” named sets are simply aliases for
groups of dimension members. You can create named sets in an OLAP cube using the BIDS
interface for OLAP cubes—specifically, the Calculations tab. We look at the MDX code used
to programmatically create named sets in the next query. There are some enhancements to
named sets in SQL Server 2008 that we’ll cover in more detail in the next chapter. The core
syntax is similar to that used to create calculated members—that is, CREATE SET…AS or WITH
SET…AS—to create a session or query specific named set.

In the following code, we enhance the preceding query by adding the Filter function to the
second calculated member and placing it in a named set. So, in addition to ordering the
results (which we do by using the Order function), we also filter the values in the calculated
set in this query.

In MDX, the SET keyword allows you to define and to create a named subset of data from the
source. SET differs from the Members function in that using the latter restricts you to return-
ing one or more values from a single dimension, whereas the former allows you to return
values from one or more dimensions. We’ll use the SET keyword in this query to define a set
of products ordered by Profit Per Unit and filtered to include only products where the Profit
Per Unit is less than 100.
Chapter 10 Introduction to MDX 309

The query results are shown in Figure 10-18.

WITH MEMBER [Measures].[Profit Per Unit] AS


[Measures].[Internet Gross Profit] / [Measures].[Internet Order Quantity],
FORMAT_STRING = 'Currency'

MEMBER [Measures].[Profit Currency] AS


[Measures].[Internet Gross Profit], FORMAT_STRING = 'Currency'

SET [OrderedFilteredProducts] AS
Filter(
Order(
[Product].[Product].Children ,
[Measures].[Profit Per Unit],DESC),
[Measures].[Profit Per Unit] < 100)

SELECT { [Measures].[Internet Order Quantity],


[Measures].[Profit Currency],
[Measures].[Profit Per Unit] } ON COLUMNS,
NON EMPTY [OrderedFilteredProducts]
ON ROWS
FROM [Adventure Works]

FIgure 10-18 Results of the enhanced WITH MEMBER query

Creating Objects by Using Scripts


There are several places in MDX where you can use a new object created by a script. These
objects can be persistent or temporary. The general rule is that using WITH creates a tempo-
rary object—such as a temporary calculated member or named set—whereas CREATE cre-
ates a persistent object—such as a calculated member or named set. You should choose the
appropriate type of object creation based on the need for reuse of the new object—WITH
creates query-specific objects only. Another way to think of this is to consider that WITH cre-
ates the objects in the context of a specific query only. CREATE creates objects for the dura-
tion of a user’s session. In the next chapter, we take a closer look at the syntax for and use of
persistant objects.
310 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The TopCount Function


The next group of MDX functions we’ll examine are straightforward. They include TopCount,
which is useful, for example, if you want to retrieve the top 10 products by Internet Sales
Amount. The TopCount function takes three parameters: the data you want to retrieve and
display, the number of members, and the measure to be used for the sort criteria. So, in
essence, a TopCount value of 10 on Internet Sales Amount is saying, “Give me the 10 products
that sold the most, in descending order.”

TopCount (and many other MDX functions) works best with named sets. Like calculated mem-
bers, named sets allow you to abstract the definition of what is to be displayed down the left
column (that is, row) of a query result set (or across the top—columns—if need be). In this
example, we’re creating a named set that retrieves the top 10 best-selling products, and then
we’re using that named set in the ROWS axis. We use named sets to illustrate this concept.
The results are shown in Figure 10-19.

WITH SET [Top10Products] AS


TopCount([Product].[Product].Children,
10,
[Measures].[Internet Sales Amount])

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Top10Products] ON ROWS
FROM [Adventure Works]

FIgure 10-19 Results of using TopCount

As we saw previously, where we sorted on multiple columns, you can use TopCount to
retrieve the top 10 combinations of products and cities, as shown here. The results are shown
in Figure 10-20.

WITH SET [Top10Products] AS


TopCount( [Product].[Product].Children * [Customer].[City].Children ,
10,
[Measures].[Internet Sales Amount])

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Top10Products] ON ROWS
FROM [Adventure Works]
Chapter 10 Introduction to MDX 311

FIgure 10-20 Multiple results on the ROWS axis using TopCount

Not surprisingly, there is a BottomCount statement, which allows you to retrieve the lowest
combination of results. Additionally, there are TopPercent and TopSum statements (and bot-
tom name counterparts). TopPercent allows you to (for example) retrieve the products that
represent the top 10 percent (or 15 percent or whatever you specify) of sales. TopSum allows
you to (for example) retrieve the highest selling products that represent the first $1 million of
sales (or some other criteria).

This next query is a bit more complex. Suppose we want the top five states by sales, and then
for each state, the top five best-selling products underneath. We can use the MDX Generate
function to perform the outer query (for the states) and the inner query for the products, and
then use State.CurrentMember to join the two. The Generate function produces a new set as a
result based on the arguments that you specify for it. Generate derives the new set by apply-
ing the set defined in the second argument to each member of the set defined in the first
argument. It returns this joined set, and eliminates duplicate members by default. Here is an
example of this type of query:

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


Generate (
TopCount (
[Customer].[State-Province].Children,5 ,
[Internet Sales Amount]),
({[Customer].[State-Province].CurrentMember},
Topcount([Product].[SubCategory].Children,5,
[Internet Sales Amount] )),ALL ) ON ROWS
FROM [Adventure Works]

The results are shown in Figure 10-21.


312 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-21 Results of using Generate with TopCount

Rank Function and Combinations


Not only can we use TopCount to retrieve the top n members based on a measure, we can
also use the Rank function to assign a sequential ranking number. Rank allows you to assign
the specific order number of each result. Just like with TopCount, you’ll want to use a named
set to fully optimize a Rank statement.

For example, suppose we want to rank states by Internet Sales Amount, showing sales for
those states with sales that are non-zero. We need to do two things. First, we create a named
set that sorts the states by Internet Sales Amount, and then we create a calculated mem-
ber that ranks each state by the ordered set. The full query, which we’ll break down later, is
shown here. The query results are shown in Figure 10-22.

WITH SET [SalesRankSet] AS


Filter(
Order(
[Customer].[State-Province].Children ,
[Measures].[Internet Sales Amount], BDESC ),
[Measures].[Internet Sales Amount] <> 0)

MEMBER [Measures].[SalesRank] AS
Rank([Customer].[State-Province].CurrentMember, [SalesRankSet])

SELECT
{ [Measures].[SalesRank],
[Measures].[Internet Sales Amount] } ON COLUMNS,
[SalesRankSet] ON ROWS
FROM [Adventure Works]
WHERE ([Product].[Bikes], [Date].[Fiscal].[FY 2003])
Chapter 10 Introduction to MDX 313

FIgure 10-22 Results of using Rank

Now let’s drill into the query just shown. First, because all ranking should ideally be per-
formed against an ordered set, you’ll want to create a predefined (named) set on states that
are sorted by descending dollar amount in descending order, as shown here:

WITH SET [SalesRankSet] AS


Filter(
Order(
[Customer].[State-Province].Children ,
[Measures].[Internet Sales Amount], BDESC ),
[Measures].[Internet Sales Amount] <> 0)

Next, because the ranking result is nothing more than a calculated column, you can create
a calculated member that uses the Rank function to rank each state against the ordered set,
like this:

MEMBER [Measures].[SalesRank] AS
Rank([Customer].[State-Province].CurrentMember, [SalesRankSet])

This code uses the CurrentMember function. In SQL Server 2005 and 2008, the Current­
Member function is implied, so you could instead write the member calculation without the
CurrentMember function like this:

MEMBER [Measures].[SalesRank] AS
Rank([Customer].[State-Province], [SalesRankSet])

To extend the current example, it’s likely that you’ll want to rank items across multiple dimen-
sions. Suppose you want to rank sales of products, but within a state or province. You’d
simply include (that is, join) Product SubCategory with State-Province in the query, as shown
here:

WITH SET [SalesRankSet] AS


Filter(
Order(
( [Customer].[State-Province].Children, [Product].[SubCategory].Children ) ,
[Measures].[Internet Sales Amount], BDESC ),
[Measures].[Internet Sales Amount] <> 0)
314 Part II Microsoft SQL Server 2008 Analysis Services for Developers

MEMBER [Measures].[SalesRank] AS
Rank( ( [Customer].[State-Province].CurrentMember,
[Product].[SubCategory].CurrentMember),
[SalesRankSet])

SELECT
{ [Measures].[SalesRank],
[Measures].[Internet Sales Amount] } ON COLUMNS,
[SalesRankSet] ON ROWS
FROM [Adventure Works]
WHERE ( [Date].[Fiscal].[FY 2004])

The query produces the result set shown in Figure 10-23. Note that because we used BDESC
to break the dimension members for state apart, we have California for two product subcate-
gories, and then England and New South Wales, and then California again. So we could have
switched State and SubCategory and essentially produced the same results.

FIgure 10-23 Results of using Rank with Filter, Order, and more

Suppose you want to show rankings for each quarter of a year. This presents an interesting
challenge because a product might be ranked first for one quarter and fourth for a differ-
ent quarter. You can use the MDX functions LastPeriods and LastChild to help you retrieve
these values. LastPeriods is one of the many time-aware functions included in the MDX
library. Other such functions include OpeningPeriod, ClosingPeriod, PeriodsToDate, and
ParallelPeriods. LastPeriods takes two arguments (the index or number of periods to go back
and the starting member name), and it returns a set of members prior to and including a
specified member.

LastChild is just one of the many family functions included in the MDX library. These func-
tions allow you to retrieve one or more members from a dimensional hierarchy based on
position in the hierarchy of the starting member or members and function type. Other func-
tions in this group include Parent, Ancestor, Siblings, and so on. LastChild returns the dimen-
sion member that is the last child (or the last member in the hierarchy level immediately
below the specified member) of the specified member. An example of this query follows, and
the results are shown in Figure 10-24.
Chapter 10 Introduction to MDX 315
WITH SET [OrderedSubCats] AS
Order([Product].[Subcategory].Children, [Measures].[Internet Sales Amount],BDESC)

MEMBER [Measures].[ProductRank] AS
Rank( [Product].[Subcategory].CurrentMember ,
[OrderedSubCats],
[Measures].[Internet Sales Amount])

SET [last4quarters] AS
LastPeriods(4,[Date].[Fiscal Quarter].LastChild)

SELECT { [last4Quarters] *
{[Measures].[Internet Sales Amount], [ProductRank]}} ON COLUMNS,
Order([Product].[Subcategory].Children,
([Measures].[Internet Sales Amount],[Date].[Fiscal Quarter].LastChild),DESC) ON ROWS
FROM [Adventure Works]

FIgure 10-24 Results of using Rank with LastPeriods

Head and Tail Functions


Next we’ll look at the MDX Head and Tail functions. Suppose, within a set of top-10 results,
you want to retrieve the final three in the list (that is, results eight through ten, inclusive). You
can use the Tail function, which returns a subset of members from the end of a set, depend-
ing on how many you specify. For example, if you wanted the bottom three results from a
top-10 listing by products and cities, you could write the following query:

WITH SET [Top10Products] AS


Tail(
TopCount( [Product].[Product].Children * [Customer].[City].Children ,
10,
[Measures].[Internet Sales Amount]),
5)

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Top10Products] ON ROWS
FROM [Adventure Works]

The result set is shown in Figure 10-25.


316 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-25 Results of using the Tail function

As you might imagine, the Head function works similarly, but allows you to return the first
n number of rows from a set. It is a common BI business requirement to retrieve the top or
bottom members of a set, so we frequently use the Head or Tail function in custom member
queries in OLAP cubes.

Hierarchical Functions in MDX


One useful feature of OLAP cubes is their ability to use hierarchies to retrieve result sets by
selecting a specific dimension member (a specific market, product group, or time element
such as a quarter or month) and drilling down (or up) to see all the child (or parent) data.
After you’ve established hierarchies in your OLAP dimensions, you can use several MDX func-
tions to navigate these hierarchies. Let’s take a look at some examples.

For starters, you can use the MDX Children function to retrieve all records for the next level
down in a hierarchy, based on a specific dimension member. For example, you can retrieve
sales for all the subcategories under the category of Bikes with the following query. The
results are shown in Figure 10-26.

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


[Product].[Product Categories].[Category].[Bikes].Children ON ROWS
FROM [Adventure Works]

FIgure 10-26 Results of using the Children function

You can take one of the results of the preceding query—for example, Road Bikes—and find
sales for the children of Road Bikes with the following query, the results of which are shown
in Figure-10-27:

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


[Product].[Road Bikes].Children ON ROWS
FROM [Adventure Works]
Chapter 10 Introduction to MDX 317

FIgure 10-27 Results of using the Children function at a lower level of the hierarchy

Sometimes you might want to take a specific item and find all the items with the same par-
ent. In other words, you might want to see all of the siblings for a specific value. If you want
to find the sales data for the Road-150 Red, 44, as well as all other products that share the
same subcategory parent as the Road-150 Red, 44, you can use the Siblings function as
shown here:

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


[Product].[Road-150 Red, 44].Siblings ON ROWS
FROM [Adventure Works]

The results are shown in Figure 10-28.

FIgure 10-28 Results of using the Siblings function

So we’ve looked at the Children function to drill down into a hierarchy and the Siblings func-
tion to look across a level. Now we’ll look at the Parent function to look up the hierarchy.
Suppose you want to know the sales for the parent member of any member value, such as
the sales for the subcategory value that serves as the parent for the Road-150 Red, 44. Just
use the following Parent function:

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


[Product].[Road-150].Parent ON ROWS
FROM [Adventure Works]
318 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The results are shown in Figure 10-29.

FIgure 10-29 Results of using the Parent function

Now that you understand some basic hierarchical functions, we can look at more advanced
functions for reading dimension hierarchies. Although MDX language functions such as
Parent, Children, and Siblings allow you to go up (or down) one level in the hierarchy, some-
times you need more powerful functions to access several levels at once. For instance, sup-
pose you need to retrieve all data from the second level (for example, product brand) down
to the fourth level (for example, product item). You could use a combination of Parent and
Parent.Parent (and even Parent.Parent.Parent) to generate the result. Fortunately, though,
MDX provides other functions to do the job more intuitively.

One such MDX function is Descendants, which allows you to specify a starting point and an
ending point, and option flags for the path to take. For instance, if we want to retrieve sales
data for Hamburg (which is at the State-Province level of the Customer.Customer Geography
hierarchy) and all children down to the postal code level (which would include cities in
between), we can write a query like this:

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


NON EMPTY ( Descendants( [Customer].[Customer Geography].[Hamburg],
[Customer].[Customer Geography].[Postal Code],
SELF_AND_BEFORE )) ON ROWS
FROM [Adventure Works]

The results of this query are shown in Figure 10-30.

Notice the SELF_AND_BEFORE flag in the preceding query. These flags allow you to scope
the results for a specific level or the distance between the two levels specified in the first two
parameters. Essentially, you can decide what specifically between Hamburg and Hamburg’s
postal codes you want to return (or if you want to go even further down the hierarchy).
Here is a list of the different flags you can use in MDX queries, along with a description for
each one:

■■ SELF Provides all the postal code data (lowest level)


■■ AFTER Provides all the data below the postal code level for Hamburg (that is,
Customers)
■■ BEFORE Gives Hamburg as the state, plus all cities for Hamburg, but not postal codes
■■ BEFORE_AND_AFTER Gives us Hamburg the state, plus all cities, plus the data below
the postal code for Hamburg (Customers)
Chapter 10 Introduction to MDX 319
■■ SELF_AND_AFTER Gives postal codes, plus data below the postal code for Hamburg
(Customers)
■■ SELF_AND_BEFORE Gives everything between Hamburg the state and all postal codes
for Hamburg
■■ SELF_BEFORE_AFTER Gives everything from Hamburg the state all the way to the
lowest level (essentially ignores the Postal Code parameter)
■■ LEAVES Same as SELF (all postal codes for Hamburg)

FIgure 10-30 Results of using Descendants

At this point, you might be impressed by the versatility of the Descendants function. We’d like
to point out that although this function is very powerful, it does have some limits. In particu-
lar, it does not allow you to navigate upward from a starting point. However, MDX does have
a similar function called Ancestors, that does allow you to specify an existing member and
decide how far up a particular defined hierarchy you would like to navigate. For example, if
you want to retrieve the data for one level up from the State-Province of Hamburg, you can
use the Ancestors function in the following manner:

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


Ancestors( [Customer].[Customer Geography].[Hamburg],
[Customer].[Customer Geography].[Country])
ON ROWS
FROM [Adventure Works]

The results are shown in Figure 10-31.


320 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-31 Results of using Ancestors

Now let’s look at a query that gives us the proverbial “everything but the kitchen sink.” We
want to take the city of Berlin and retrieve both the children sales data for Berlin (Postal Code
and Customers), as well as the sales data for all parents of Berlin (the country and the All
Customer total). We can achieve this by doing all of the following in the query. The results are
shown in Figure 10-32

■■ Use Descendants to drill down from Berlin to the individual customer level
■■ Use Ascendants to drill up from Berlin to the top level of the Customer hierarchy
■■ Use Union to filter out any duplicates when combining two sets into one new result set.
(If you don’t use Union, the city of Berlin would appear twice in the result set.) You can
use the optional ALL flag with the Union function if you want to combine two sets into
one new set and preserve duplicates while combining the sets.
■■ Use Hierarchize to fit all the results back into the regular Customer hierarchy order.
Hierarchize restores the original order of members in a newly created set by default.
You can use the optional flag POST to sort the members of the newly created set into a
“post natural” (or “child before parent members”) order.

SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS,


Hierarchize(
Union(
Ascendants( [Customer Geography].[Berlin] ),
Descendants( [Customer Geography].[Berlin],
[Customer].[Customer Geography].[Customer], SELF_AND_BEFORE )))
ON ROWS
FROM [Adventure Works]

FIgure 10-32 Results of combining MDX navigational functions


Chapter 10 Introduction to MDX 321

Date Functions
Regardless of what other dimensions are in your data warehouse, it’s almost impossible to
imagine an OLAP cube without some form of date dimensions. Most business users need
to “slice and dice” key measures by month, quarter, year, and so on—and they also need to
perform trend analysis by comparing measures over time. MDX provides several functions for
business users to break out and analyze data by date dimensions. Let’s take a look at some of
the most common date functions.

Before we get started with MDX date functions, let’s look at a basic example of the Children
function against a specific date dimension member expression. For example, if you want to
retrieve all the child data for FY 2004, you can issue the following query. The query results
are shown in Figure 10-33.

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Date].[Fiscal].[FY 2004].Children ON ROWS
FROM [Adventure Works]

FIgure 10-33 Results of using Children with date values

If you want to get everything between the Fiscal Year and Fiscal Month levels for 2004, you
can take advantage of the Descendants function as shown in this query:

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


Descendants (
[Date].[Fiscal].[FY 2004],
[Date].[Fiscal].[Month] ,
SELF_AND_BEFORE ) ON ROWS
FROM [Adventure Works]

The query results are shown in Figure 10-34.


322 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-34 Results of using Descendants with date values

Next, let’s say you want to look at sales for Q4 2004 and sales for “the same time period a
year ago.” You can use the ParallelPeriod function, which allows you to retrieve members
from some other period (at any defined level) in the hierarchy, in our case back one year prior
(or four quarters prior, in this instance) as shown in the following query. The results are shown
in Figure 10-35.

WITH MEMBER [SalesFromLYQuarter] AS


( [Measures].[Internet Sales Amount],
ParallelPeriod( [Date].[Fiscal].[Fiscal Quarter], 4) )

SELECT { [Measures].[Internet Sales Amount],


[SalesFromLYQuarter] } ON COLUMNS,
[Product].[Bikes].Children ON ROWS
FROM [Adventure Works]
WHERE [Date].[Fiscal].[Q4 FY 2004]

FIgure 10-35 Results of using the ParallelPeriod function

As mentioned previously, OpeningPeriod is one of the many time-based functions included in


the MDX library. It is a common business requirement to get a baseline value from an open-
ing time period. There is also a corresponding ClosingPeriod MDX function. OpeningPeriod
takes two arguments: the level and the member from which you want to retrieve the values.
You can also use the OpeningPeriod function to retrieve values from, in our case, the first
month and first quarter in a particular period as shown in the following query:
Chapter 10 Introduction to MDX 323
WITH MEMBER [First Month] AS
([Measures].[Internet Sales Amount],
OpeningPeriod ( [Date].[Fiscal].[Month],
[Date].[Fiscal])) , FORMAT_STRING = 'CURRENCY'

MEMBER [First Quarter] AS


([Measures].[Internet Sales Amount],
OpeningPeriod ( [Date].[Fiscal].[Fiscal Quarter],
[Date].[Fiscal])) , FORMAT_STRING = 'CURRENCY'

SELECT {[First Month],


[First Quarter],
[Measures].[Internet Sales Amount]} ON COLUMNS,
NON EMPTY [Product].[SubCategory].Children ON ROWS
FROM [Adventure Works]
WHERE [Date].[Fiscal].[FY 2004]

The query results are shown in Figure 10-36.

FIgure 10-36 Results of using OpeningPeriod

You might simply want to show sales for a certain date period (month, quarter, and so on),
and then also show sales for the prior period. You can also use the PrevMember statement in
a calculated member, as in the following query. The results are shown in Figure 10-37.

WITH MEMBER [SalesPriorDate] AS


([Measures].[Internet Sales Amount], [Date].[Fiscal].PrevMember),
FORMAT_STRING = 'CURRENCY'

SELECT {[Measures].[Internet Sales Amount], [SalesPriorDate]} ON COLUMNS,


Order( [Customer].[State-Province].Children,
[SalesPriorDate],BDESC)
HAVING [SalesPriorDate] > 300000
ON ROWS
FROM [Adventure Works]
WHERE [Date].[Fiscal].[FY 2004] -- will also show 2003
324 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIgure 10-37 Results of using PrevMember to display specific values

Finally, you can use LastPeriods and LastChild to retrieve the last four quarters of available
data. Note the use of the Order function to sort the fiscal quarters in descending quarter
names in descending order by using the BDESC keyword in the Order function call as shown
in the query. The BDESC keyword is one of the optional keywords that you can add to affect
the sort results attained by applying the Order function. The options are ascending (ASC),
descending (DESC), ascending breaking the natural hierarchy (BASC), or descending breaking
the natural hierarchy (BDESC). By “breaking the natural hierarchy” we mean sort the results
by the value specified and do not maintain the sort within the sort. The default is ASC with
the Order function. In our example, BDESC causes the results to be sorted by the order of the
member values in the fiscal quarter rather than by the order of the sales amount. The results
are shown in Figure 10-38.

WITH SET [Last4Quarters] AS


Order( LastPeriods(4, [Date].[Fiscal Quarter].LastChild),
[Date].[Fiscal Quarter].CurrentMember.Properties ('Key'),BDESC)

SELECT [Measures].[Internet Sales Amount] ON COLUMNS,


[Last4quarters] ON ROWS
FROM [Adventure Works]

FIgure 10-38 Results of using LastChild with date values

Using Aggregation with Date Functions


The Sum function is not new to you, but we’ll start with it as a basis for showing examples
of other statistical functions included in the MDX library. We also use one of the time-based
functions in the query—PeriodsToDate. This function creates a set that is passed to the Sum
function. In addition to PeriodsToDate, MDX also includes the shortcut functions Wtd, Mtd,
Qtd, and Ytd. These are simply variants of the PeriodsToDate function that are created to work
with time data from a specific level—such as weeks, months, quarters, or years.
Chapter 10 Introduction to MDX 325

This query shows a simple Sum aggregate in the calculated member section:

WITH MEMBER [SalesYTD] AS


Sum(
PeriodsToDate ([Date].[Fiscal].[Fiscal Year],
[Date].[Fiscal].CurrentMember) ,
[Measures].[Internet Sales Amount])

SELECT { [Measures].[Internet Sales Amount],


[Measures].[SalesYTD] } ON COLUMNS,
[Product].[Category].Children ON ROWS
FROM [Adventure Works]
WHERE [Date].[Q3 FY 2004]

The results are shown in Figure 10-39.

FIgure 10-39 Results of using Sum with date values

Here’s an interesting challenge. Suppose you want to list sales for each month and also show
the 12-month moving average of sales (in other words, for each month, the average monthly
sales for the prior 12 months). You can aggregate using the Avg function, and then use the
LastPeriods function to go back 12 months as shown in the following query. The results are
shown in Figure 10-40.

WITH MEMBER [12MonthAvg] AS


Avg(LastPeriods(12,[Date].[Calendar].PrevMember),
[Measures].[Internet Sales Amount])

SELECT {[Measures].[Internet Sales Amount], [12MonthAvg]} ON COLUMNS,


[Date].[Calendar].[Month] ON ROWS
FROM [Adventure Works]
WHERE [Date].[FY 2003]

FIgure 10-40 Results of using the Avg function with date values
326 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Of course, there are statistical functions other than Sum and Average available in the MDX
library. Some of the other statistical functions that we commonly use are these: Count, Max,
Median, Rank, Var, and the functions that relate to standard deviation, Stdev and StdevP.

About Query Optimization


There are many different factors involved in query processing in SSAS. These include physical
configuration of OLAP data (such as partitions), server hardware and configuration (memory,
CPU, and so on), and internal execution processes. We’ll take a closer look at a key aspect of
these internal execution processes next.

As with Transact-SQL query execution in the relational SQL Server engine, SQL Server SSAS
makes use of internal mechanisms to determine the most optimal MDX query execution
plans. These mechanisms include multiple internal caches to efficiently produce results. These
caches are scoped quite differently, so you’ll want to understand these scopes when evaluat-
ing and tuning MDX queries. In general, when you execute MDX queries, if data is in a cache,
SSAS will retrieve it from a cache rather than from on disk. On-disk data includes both calcu-
lated aggregations and fact data.

Cache scopes include query context, session context, and global context. We include this
information because each query can make use of the cache of only a single context type in
a particular execution. A concept that you’ll work with when evaluating MDX query perfor-
mance is that of a subcube. Subcubes are subsets of cube data that are defined by MDX que-
ries. It is important for you to understand that each MDX query is broken into one or more
subcubes by the SSAS query optimizer. When you are evaluating query performance using
diagnostic tools such as SQL Server Profiler, you’ll examine these generated subcubes to
understand the impact of the MDX query execution. The most efficient scope to access is the
global scope because it has the broadest reuse possibilities.

Of course, other factors (such as physical portioning and aggregation design) affect query
performance. Those topics were covered in Chapter 9, “Processing Cubes and Dimensions,”
and this chapter has focused on writing efficient MDX query syntax. We can’t cover every
scenario in this chapter, so you might also want to review the SQL Server Books Online topic
“Performance Improvements for MDX in SQL Server 2008 Analysis Services” at http://msdn.
microsoft.com/en­us/library/bb934106.aspx.

When you are evaluating the effectiveness of your MDX statement, there are advanced
capture settings available in SQL Server Profiler—such as Query Processing/Query
Subcube Verbose—that you can use to evaluate which cache (if any) was used when
executing your query. For more information, see the white paper titled “SQL Server
Analysis Services Performance Guide” at http://www.microsoft.com/downloads/
details.aspx?FamilyID=3be0488d­e7aa­4078­a050­ae39912d2e43&DisplayLang=en.
Chapter 10 Introduction to MDX 327

Summary
In this chapter, we reviewed MDX syntax and provided you with many examples. These code
samples included many of the functions and keywords available in the MDX language. We
kept the code examples as simple as possible so that we could demonstrate the functional-
ity of the various MDX statements to you. In the real world, of course, business requirements
often dictate that you’ll work with queries of much greater complexity. To that end, in the
next chapter we’ll take a more in-depth look at how you can use MDX to solve common ana-
lytical problems in a data warehouse environment.
Chapter 11
Advanced MDX
Now that you’ve seen MDX examples in the previous chapter, we turn our attention to more
advanced uses of MDX, including using MDX in real-world applications. In this chapter, we
take a look at a number of examples related to advanced MDX query writing. These include
querying dimension properties, creating calculated members, using the IIf function, working
with named sets, and gaining an understanding of scripts and SOLVE_ORDER. We also look
at creating KPIs programmatically. We close the chapter with an introduction to working with
MDX in SQL Server Reporting Services (SSRS) and PerformancePoint Server.

Querying Dimension Properties


We spent considerable time in the last chapter talking about hierarchies and drilling down
from countries and product groups to see sales data summarized at various (dimension) lev-
els, such as state, product subcategory, and so on. Although that’s obviously an important
activity in an OLAP environment, there are other ways to “slice and dice” data.

For instance, in the Adventure Works DW 2008 Customer dimension, there are many
demographic attributes you can use to analyze sales data. In Figure 11-1, you see dimension
members such as Education, Marital Status, and Number Of Cars Owned.

Figure 11-1 Various dimension hierarchy attributes

329
330 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In previous versions of SQL Server Analysis Services (SSAS), these infrequently used attributes
were often implemented as member properties rather than as dimension attributes. Starting
with the redesigned SSAS implementation in 2005, most developers choose to include these
values as dimension attributes because query performance is improved. Usually, there is no
aggregate hierarchy defined on these attributes—that is, they are presented as single-line
attributes in the cube with no rollup navigational hierarchies defined. It is a common business
requirement—for example, for a marketing team—to be able to analyze past data based on
such attributes so that they can more effectively target new sales campaigns.

You can write MDX queries that include these dimensional attributes. For example, if you
wanted to see sales revenue for bike sales in France, broken down by the number of cars
owned by the customer, you can write the following MDX query:

SELECT [Internet Sales Amount] on CoLumnS,


[Customer].[number of Cars owned].[number of Cars owned].members on rowS
from [Adventure works]
whErE ([Product].[Bikes] , [Customer].[france])

Suppose you want to produce a query that generates sales for the last 12 months of available
data and lists the sales in reverse chronological sequence (most recent month at the top). You
might be tempted to write the following MDX query:

wITh SET [Last12months] AS


order(LastPeriods(12,
Tail([Date].[fiscal].[month].members,1).Item(0).Item(0)),
[Date].[fiscal],BDESC)

SELECT [Internet Sales Amount] on CoLumnS,


[Last12months] on rowS
from [Adventure works]

Does this generate the desired output? The numbers don’t lie, and they show that the desired
outcome was not achieved—take a look at Figure 11-2:

Figure 11-2 Incorrect results when trying to order by descending dates

There are two issues with the result set. First, you didn’t achieve the desired sort (of descend-
ing months). Second, you have the empty months of August 2004 and November 2006. The
Chapter 11 Advanced MDX 331

second problem occurs because the MDX query used the MDX Tail function to retrieve the
last member in the Date.Fiscal.Month hierarchy; however, there isn’t any actual data posted
for those two months. Now that you see the result set, you can see that what you really want
is the last 12 months of available data, where the first of the 12 months is the most recent
month where data has been posted (as opposed to simply what’s in the month attribute
hierarchy).

Let’s tackle the problems in reverse order. First, let’s change the query to determine the last
month of available data by filtering on months against the Internet Sales Amount and using
the Tail function to retrieve the last member from the list:

wITh SET [Lastmonth] AS


Tail( filter([Date].[Calendar].[month],[Internet Sales Amount]),1)

SET [Last12months] AS
order(LastPeriods(12,[Lastmonth].Item(0).Item(0)),
[Date].[fiscal],BDESC)

SELECT [Internet Sales Amount] on CoLumnS,


[Last12months] on rowS
from [Adventure works]

That takes care of the second issue (the most recent month in the result set is now July 2004
instead of November 2006), but you still have the issue of the sort. So why doesn’t the code
in the first two listings, which sorts on the [Date].[Fiscal] level, work correctly?

Generally speaking, sorting on dimension attributes is different than sorting on measures.


Each dimension attribute has a KeyColumns collection that you can use for ordering. In the
case of the Month attribute, the KeyColumns collection contains two definitions (the year and
month stored as integers), as shown in Figure 11-3.

So any MDX query that sorts on the actual month values must reference both KeyColumns
collection properties. You can reference the key properties with Properties (“Key0”) and
Properties (“Key1”), as shown in the following code listing. Note that because the second key
is an integer key representing the month, you need to right-justify and zero-fill it, using a
Microsoft Visual Basic for Applications function. This is so that the year and month combined
will be represented consistently (that is, 200809 for September 2008, 200810 for October
2008) for sorting purposes.

wITh SET [Lastmonth] AS


Tail( filter([Date].[Calendar].[month],[Internet Sales Amount]),1)

SET [Last12months] AS
order(LastPeriods(12,[Lastmonth].Item(0).Item(0)),
[Date].[fiscal].Currentmember.Properties("Key0") + vBA!right("0" +
[Date].[fiscal].Currentmember.Properties("Key1") ,2),BDESC)

SELECT [Internet Sales Amount] on CoLumnS,


[Last12months] on rowS
from [Adventure works]
332 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 11-3 Reviewing dimension attribute properties in Business Intelligence Development Studio (BIDS)

The preceding code generates the results in Figure 11-4—mission accomplished! The great
thing about this query is that you can use this approach any time you need to generate a
report for a user that shows the last 12 months of available data.

Figure 11-4 The correct results when ordering by descending dates

Looking at Date Dimensions and MDX Seasonality


Many organizations have business requirements related to retrieving data that has seasonal-
ity, and they evaluate sales performance based on different time frames across years. One
common scenario is a user wanting to know how sales are faring when compared to the fis-
cal year’s goals. Figure 11-5 shows the Calculations tab in BIDS with an MDX expression that
retrieves the sum of the periods to date in the year for Internet Sales Amount.
Chapter 11 Advanced MDX 333

Figure 11-5 Performing a calculation of all PeriodsToDate in BIDS

As mentioned in Chapter 10, “Introduction to MDX,” MDX includes several shortcut functions
for common time-based queries. These functions are variants of the PeriodsToDate function.
Let’s take a closer look at how one of these functions works. For our example, we’ll use the
Ytd function. This function returns a set of sibling members from the same level as a given
member, starting with the first sibling and ending with the given member, as constrained by
the Year level in the Time dimension. The syntax is Ytd([«Member»]).

This function is a shortcut function to the PeriodsToDate function that defines that func-
tion’s «Level» argument to be Year. If no particular member is specified, a default value of
Time.CurrentMember is used. Ytd(«Member») is equivalent to PeriodsToDate(Year, «Member»).
Other examples of time-aware MDX functions are the week-to-date, month-to-date, and
quarter-to-date (Wtd, Mtd, and Qtd) functions.

Creating Permanent Calculated Members


Up until now, you’ve been placing calculations and named sets in-line, as part of the MDX
query. In an actual production environment, if you want to reuse these definitions, it’s a
good idea to store these calculated members and named sets inside the OLAP cube. As we
discussed in Chapter 8, “Refining Cubes and Dimensions,” you create permanent calculated
members using the Calculations tab in BIDS.

For example, you might want to create a calculation for sales for the previous fiscal period
(which could be the previous month, previous quarter, or even previous year from the current
date selection) using the following MDX code.
334 Part II Microsoft SQL Server 2008 Analysis Services for Developers

wITh mEmBEr [SalesPriorfiscalPeriod] AS


([measures].[Internet Sales Amount],
[Date].[fiscal].Prevmember)

SELECT {[SalesPriorfiscalPeriod],
[Internet Sales Amount]} on CoLumnS,
non EmPTY [Product].[Category].members on rowS
from [Adventure works]
whErE [Date].[march 2004]

To store this calculated member (SalesPriorFiscalPeriod) permanently in an OLAP cube, you


can use the BIDS interface to create a calculated member. If you do not want the member to
be permanent but do want to use it over multiple queries, you can create a calculated mem-
ber that will persist for the duration of the session. The syntax to do this is CREATE MEMBER
rather than WITH MEMBER.

Creating Permanent Calculated Members in BIDS


As we’ve shown previously, when you create a new calculated member using the BIDS cube
designer Calculations tab, you first enter the name for the new calculated member and then
the associated MDX expression. (See Figure 11-6.)

Figure 11-6 BIDS interface to create a new calculated member

After you save the changes, you can reference the new calculated member in any MDX query
in the same way you reference any other member that is defined in the cube. (See the fol-
lowing code sample and its results, which are shown in Figure 11-7.) This query returns the
prior month’s (February 2004) Internet Sales Amount in the SalesPriorFiscalPeriod member.
Note that in the query, you could have specified a quarter or a year instead of a month, and
Chapter 11 Advanced MDX 335

the calculated member would give you the prior corresponding period. For example, if you
specified FY 2004 as the date, the SalesPriorFiscalPeriod member would return the Internet
Sales Amount for 2003.

This calculated member is computed when an MDX query containing this member is exe-
cuted. As we mentioned in Chapter 8, calculated member values are not stored on disk;
rather, results are calculated on the fly at query execution time. Of course, internal caching
can reuse calculated member results very efficiently. You’ll recall from our discussion at the
end of Chapter 10 that SSAS uses three internal cache contexts (query, session, and global) to
store results. You can investigate use of query caches using SQL Server Profiler traces as well.

SELECT {[SalesPriorfiscalPeriod],
[Internet Sales Amount]} on CoLumnS,
non EmPTY [Product].[Category].members on rowS
from [Adventure works]
whErE [Date].[march 2004]
--whErE [Date].[Q2 fY 2004]
--whErE [Date].[fY 2004]

Figure 11-7 The results of querying a permanent calculated member

Creating Calculated Members Using MDX Scripts


Now that you’ve created a permanent saved script that will create a calculated member using
the BIDS interface, let’s create a second calculated member using a CREATE MEMBER state-
ment. (In fact, when you use the BIDS interface to create a calculated member, BIDS actually
generates scripts behind the scenes. You can toggle between the form view in BIDS and the
script listing by using the Form View and Script View icons on the Calculations toolbar.)

This second calculated member will determine the percentage of sales growth from one
period to the prior period (and will actually use the SalesPriorFiscalPeriod member from
the previous section). You can execute this calculated member code from SQL Server
Management Studio (SSMS) after connecting to the corresponding SSAS instance and sample
database.

CrEATE mEmBEr [Adventure works].[measures].[fiscalSalesPctGrowth] AS


( [measures].[Internet Sales Amount] - [SalesPriorfiscalPeriod]) /
[SalesPriorfiscalPeriod] ,
formAT_STrInG = "Percent",
vISIBLE = 1;
336 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Finally, you can write an MDX query to use both calculations, which will produce a result set
arranged by product category for March 2004, showing the dollar sales for the month, the
dollar sales for the previous month, and the percent of change from one month to the next.
(See Figure 11-8.) Once again, note that you could have selected other date dimension mem-
bers in the WHERE clause and the calculated members would have behaved accordingly.

SELECT {[SalesPriorfiscalPeriod],
[fiscalSalesPctGrowth],
[Internet Sales Amount]} on CoLumnS,
non EmPTY [Product].[Category].members on rowS
from [Adventure works]
whErE [Date].[march 2004]
--whErE [Date].[Q2 fY 2004]
--whErE [Date].[fY 2004]

Figure 11-8 Results of calculated member created through scripting

Keep in mind that calculated members do not aggregate, so they do not increase storage
space needs for the cube. Also, for this reason, you do not need to reprocess the associated
OLAP cube when you add a calculated member to it. Although it’s easy to add calculated
members, you must carefully consider the usage under production load. Because member
values are calculated upon querying (and are not stored on disk), the query performance for
these values is slower than when you are accessing stored members.

New to SQL Server 2008 is the ability to dynamically update a calculated member using the
UPDATE MEMBER syntax. The only member types that you can update using this syntax are
those that are defined in the same session (scope). In other words, UPDATE MEMBER cannot
be used on the BIDS cube designer’s Calculations (MDX script) tab; rather, it can be used only
in queries from SSMS or in custom code solutions.

Tip Also new to SQL Server 2008, the CREATE MEMBER statement allows you to specify a display
folder (property DISPLAY_FOLDER) and an associated measure group (property ASSOCIATED_
MEASURE_GROUP). Using these new properties can make your calculated members more discov-
erable for end users.
Chapter 11 Advanced MDX 337

using IIf
You might encounter a common problem when you use the calculated members that you
created in the last section. Let’s take a look at the results if you were to run the query for the
first month (or quarter, year, and so on) of available data. If you’re wondering what the prob-
lem is, think about how MDX would calculate a previous member (using the PrevMember
statement) for month, quarter, year, and so on when the base period is the first available
period. If no previous member exists, you get nulls for the SalesPriorFiscalPeriod member,
and division by null for the FiscalSalesPctGrowth member. (See Figure 11-9.)

Figure 11-9 The results when querying for the first period, when using a PrevMember statement

Fortunately, MDX provides an immediate if (IIf) function so that you can test for the
presence of a previous member before actually using it. So the actual calculation for
SalesPriorFiscalPeriod is as follows:

IIf( ([measures].[Internet Sales Amount], [Date].[fiscal].Prevmember),


([measures].[Internet Sales Amount], [Date].[fiscal].Prevmember),
'n/A')

So, in this example, you get an N/A in the SalesPriorFiscalPeriod member any time data does
not exist for the previous member in the Date dimension. You can perform a similar IIf check
for the FiscalSalesPctGrowth member, using a similar code pattern as shown in the preceding
example, and then generate a better result set. (See Figure 11-10.)

Figure 11-10 The results when implementing an IIf function to check for the existence of a previous member
of the Date dimension

There are alternatives to using an IIf statement. One such alternative is to rewrite the query
using the MDX CASE keyword. For more information and syntax examples of using CASE, go
to http://msdn.microsoft.com/en-us/library/ms144841.aspx. Also note that conditional logic
captured in an MDX query using CASE rather than IIf (particularly if they are nested) often
executes more efficiently.

Another way to circumvent the potential performance issues associated with the IIf function
is by creating expressions as calculated members using the SCOPE keyword to limit the scope
338 Part II Microsoft SQL Server 2008 Analysis Services for Developers

of the member definition to a subcube. This type of calculation is an advanced technique,


and you should use this only when your IIf statement does not perform adequately under a
production load. For more detail, see the following blog entry: http://blogs.msdn.com/azazr/
archive/2008/05/01/ways-of-improving-mdx-performance-and-improvements-with-mdx-in-
katmai-sql-2008.aspx.

Note In SQL Server 2008, Microsoft has improved the performance of several commonly used
MDX functions, such as IIf and others. However, to realize these performance improvements,
you must avoid several conditions. One example is the usage of defined cell security in an MDX
expression with an optimized function. For more detail, see the SQL Server Books Online topic
“Performance Improvements in MDX for SQL Server 2008 Analysis Services” at http://msdn.micro-
soft.com/en-us/library/bb934106.aspx.

About Named Sets


We introduced named sets back in Chapter 10. Named sets are a collection of tuples (often
an ordered collection) from one or more dimensions. So, as a reminder example, you can
build a list of the top 10 products by profit, use a Rank function to generate a ranking num-
ber for each product, and then display the Internet Gross Profit and rank number for each
product, for sales in Canada (as shown in the next code sample and Figure 11-11).

wITh SET [Top10ProductsByProfit] AS


TopCount( [Product].[Product Categories].[Product].members,
10,[measures].[Internet Gross Profit])

mEmBEr [ProductProfitrank] AS
rank([Product].[Product Categories].Currentmember,
[Top10ProductsByProfit])

SELECT {[measures].[Internet Gross Profit],


[measures].[ProductProfitrank]} on CoLumnS,
[Top10ProductsByProfit] on rowS
from [Adventure works]
whErE [Customer].[Country].[Canada]

Figure 11-11 Basic top-10 list generated when using TopCount and Rank in-line
Chapter 11 Advanced MDX 339

So, now that you’ve defined a useful named set in MDX, let’s store the script that creates it
permanently inside the OLAP cube so that you can reuse it. Using steps similar to those in the
previous section, you can use BIDS to permanently create the MDX script that will create the
named set and calculated member. Alternatively you can execute an MDX script in SSMS to
create a named set that will persist for the duration of the user session. After you create the
defined named sets, you can test that script by writing a small MDX query (shown in the fol-
lowing code sample) that uses the (now persistent) calculated member ProductProfitRank and
the (now persistent) named set Top10ProductsByProfit:

SELECT {[measures].[Internet Gross Profit],


[measures].[ProductProfitrank]} on CoLumnS,
[Top10ProductsByProfit] on rowS
from [Adventure works]

This code generates the result set shown in Figure 11-12.

Figure 11-12 Correct results when using a persistent named set, with no subsequent dimension slicing

The results in Figure 11-12 seem fine. Of course, you didn’t do any dimension slicing, so the
TopCount and Rank functions are running against the entire OLAP database (by Product).
Let’s run our persistent named set against sales for Canada (as shown in the following code
sample), which produces the result set shown in Figure 11-13.

SELECT {[measures].[Internet Gross Profit],


[measures].[ProductProfitrank]} on CoLumnS,
[Top10ProductsByProfit] on rowS
from [Adventure works]
whErE [Customer].[Country].[Canada]

So, what’s wrong with this query? Although the numbers are certainly smaller, the profit
ranking and order of items are not correct. Here is the reason why: In SQL Server 2005,
persistent named sets were static in nature. Unlike persistent calculated members, which
are always dynamically evaluated when dimension slicing occurs, persistent named sets
were only evaluated once, when the set is created. So the actual TopCount ordered set (and
the Rank function that used the named set) are working off the original order, before any
dimension slicing occurs. This was a significant drawback of persistent named sets in SQL
Server 2005.
340 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 11-13 Incorrect results when using a persistent named set, with subsequent dimension slicing

Fortunately, SQL Server 2008 introduces a new feature called dynamic named sets, which
solves this issue. Named sets marked with the new keyword DYNAMIC will evaluate for each
query run. In BIDS, you can choose to create named sets as either dynamic or static. (See
Figure 11-14.)

Figure 11-14 Creating a dynamic named set in SQL 2008 to honor subsequent dimension slicing

The BIDS designer actually generates the following code for the named set:

CrEATE DYnAmIC SET CurrEnTCuBE.[Top10ProductsByProfit] AS


TopCount( [Product].[Product Categories].[Product].members,
10,[measures].[Internet Gross Profit]) ;

After changing the named set and redeploying the cube, you can re-execute the query
shown in the code sample that precedes Figure 11-13 and get the correct results (as shown in
Figure 11-15).

Tip MDX expert Mosha Pasumansky has written a useful blog entry on MDX dynamic named
sets at http://sqljunkies.com/WebLog/mosha/archive/2007/08/24/dynamic_named_sets.aspx.
Chapter 11 Advanced MDX 341

Figure 11-15 Correct results when using a persistent dynamic named set with dimension slicing

About Scripts
You’ll remember from previous chapters that in addition to using calculated members, you
can also add MDX scripts to your SSAS cube via the Calculations tab in BIDS. Here you can
use the guided interface to add calculated members and more, or you can simply type the
MDX into the script code window in BIDS. It is important that you understand that the script
you create uses at least one instance of the MDX SCOPE keyword. Using SCOPE in an MDX
script allows you to control the scope in which other MDX statements are applied. A script
with a SCOPE keyword allows you to define a subset of your cube (which is sometimes called
a subcube). Unlike a named set, this subcube is usually created so that you can read it as well
as make changes to it or write to it.

Note You can also use many MDX keywords—such as CALCULATE, CASE, FREEZE, IF, and oth-
ers—in an MDX script. For more information, see the SQL Server Books Online topics “The Basic
MDX Script” and “MDX Scripting Statements.”

A common business scenario for using subcubes is the one shown in the following example—
that is, budget allocations based on past history and other factors. Subcubes are convenient
for these kinds of scenarios because it is typical for business budgeting to be based on a
number of factors—some past known values (such as actual sales of a category of products
over a period of time for a group of stores by geography) combined with some future pre-
dicted values (such as newly introduced product types, for which there is no sales history).
These factors often need to be applied to some named subset (or subcube) of your enter-
prise data.

There are two parts to a SCOPE command. The SCOPE section defines the subcube that the
subsequent statements will apply to. The This section applies whatever change you want
to make to the subcube. We’ll also look at the FREEZE statement as it is sometimes used in
scripts with the SCOPE command.
342 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The sample script that is part of the Adventure Works cube, called Sales Quota Allocation, is a
good example of using script commands. Switch to the script view on the Calculations tab in
BIDS and you’ll see two complete scripts (using both the SCOPE statement and This function)
as shown in the following code sample:

/*--------------------------------------------------------------
| Sales Quota Allocation |
--------------------------------------------------------------*/

/*-- Allocate equally to quarters in h2 fY 2005 --------------*/

SCoPE
(
[Date].[fiscal Year].&[2005],
[Date].[fiscal].[fiscal Quarter].members,
[measures].[Sales Amount Quota]
) ;

This = ParallelPeriod
(
[Date].[fiscal].[fiscal Year], 1,
[Date].[fiscal].Currentmember
) * 1.35 ;

/*--- Allocate equally to months in fY 2002 --------------------*/

SCoPE
(
[Date].[fiscal Year].&[2002],
[Date].[fiscal].[month].members
) ;

This = [Date].[fiscal].Currentmember.Parent / 3 ;

End Scope ;

Here is a bit more detail on the This function and FREEZE statement, which are often used in
conjunction with the SCOPE keyword:

■■ This This function allows you to set the value of cells as defined in a subcube (usually
by using the MDX keyword SCOPE to define the subcube). This is illustrated in a script in
the preceding code sample.
■■ FREEZE This statement (not a function) locks the specified value of the current sub-
cube to the specified values. It’s used in MDX scripts to pin a subcube (that is, exempt
it from being updated) during the execution of subsequent MDX statements using the
SCOPE statement and the This function. An example is shown in the following code
sample:

frEEZE
( [Date].[fiscal].[fiscal Quarter].members,
[measures].[Sales Amount Quota]
);
Chapter 11 Advanced MDX 343

An important consideration when using the new Calculations tab in BIDS to design MDX
script objects is the order in which you add the script objects. Scripts are executed in the
order (top to bottom) listed in the Script Organizer window. You can change the order of
execution by right-clicking any one script and then clicking Move Up or Move Down. You can
also change the order of execution for calculated members (or cells) by using the MDX key-
word SOLVE_ORDER (explained in the next section of this chapter) inside the affected scripts.

understanding SOLVe_OrDer
Suppose you want to produce a result set that shows sales amount, freight, and freight per
unit as columns, and for these columns, you want to show Q3 2004, Q4 2004, and the dif-
ference between the two quarters as rows. Based on what you’ve done up to this point, you
might write the query as follows (which would produce the result set shown in Figure 11-16):

wITh mEmBEr [measures].[freightPerunit] AS


[measures].[Internet freight Cost] /
[measures].[Internet order Quantity]
, formAT_STrInG = '$0.00'

mEmBEr [Date].[fiscal].[Q3 to Q4Growth] AS


[Date].[fiscal].[fiscal Quarter].[Q4 fY 2004] -
[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004]

SELECT
{[Internet Sales Amount],[Internet freight Cost],
[freightPerunit] } on CoLumnS,
{[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004],
[Date].[fiscal].[fiscal Quarter].[Q4 fY 2004],
[Date].[fiscal].[Q3 to Q4Growth] } on rowS
from [Adventure works]

Figure 11-16 First result set, for an all customer total

Do the results for the query look correct? Specifically, take a look at the FreightPerUnit calcu-
lation for the third row (that shows the difference between the two quarters). The cell should
contain a value of 72 cents ($8.42 minus $7.70). The cell, however, contains $12.87. Although
that value represents “something” (the growth in freight cost divided by the growth in order
quantity), the bottom row should contain only values that represent the change in each col-
umn. So, for the FreightPerUnit column, it should be the freight per unit for Q4 minus the
freight per unit for Q3.
344 Part II Microsoft SQL Server 2008 Analysis Services for Developers

So why isn’t the correct calculation being generated? Before you answer that question, stop
and think about the query. The requirements for this query ask you to do something you
previously haven’t done—perform calculations on both the row and column axes. Prior to
this, you’ve generally created only new calculated members in one dimension—namely, the
measures dimension.

In this case, however, calculated measures are created in non-measure dimensions, so you
must consider the order of execution of these measures. Specifically, you need to tell MDX
that you want to calculate the FreightPerUnit member first, and then the Growth member
second. Stated another way, you need to set the calculation order, or solve order. MDX con-
tains a keyword, SOLVE_ORDER, that allows you to set the solve order for each calculated
member. So you can add the following code to the two calculated members as shown in the
following code sample, with the results shown in Figure 11-17.

wITh mEmBEr [measures].[freightPerunit] AS


[measures].[Internet freight Cost] /
[measures].[Internet order Quantity]
, formAT_STrInG = '$0.00', SoLvE_orDEr = 0

mEmBEr [date].[fiscal].[Q3 to Q4Growth] AS


[Date].[fiscal].[fiscal Quarter].[Q4 fY 2004] -
[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004] , SoLvE_orDEr = 10

SELECT
{[Internet Sales Amount],[Internet freight Cost],
[freightPerunit] } on CoLumnS,
{[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004],
[Date].[fiscal].[fiscal Quarter].[Q4 fY 2004],
[Date].[fiscal].[Q3 to Q4Growth] } on rowS
from [Adventure works]

Figure 11-17 First result set for an all customer total

When you create calculated members on both the row and column axes, and one depends
on the other, you need to tell MDX in what order to perform the calculations. In our experi-
ence, it’s quite easy to get this wrong, so we caution you to verify the results of your SOLVE_
ORDER keyword.

Note For more on solve orders, see the MSDN topic “Understanding Pass Order and Solve
Order (MDX)” at http://msdn.microsoft.com/en-us/library/ms145539.aspx.
Chapter 11 Advanced MDX 345

Creating Key Performance indicators


You can create key performance indicators (KPIs) in SSAS cubes using MDX code using the
KPIs tab, as you’ve seen in Chapter 8, and then you can use those KPIs in client tools such as
Microsoft Office Excel 2007 or PerformancePoint Server 2007. Because MDX code is the basis
for KPIs, let’s take a look at a basic KPI. Here you’ll use calculated members as part of your
KPI definition.

The Adventure Works database tracks a measure called Total Product Cost. Suppose you
want to evaluate the trend in Total Product Cost. First you need a calculated member that
determines Total Product Cost for the previous period based on the current period. Figure
11-18 shows a calculated member that slices Total Product Cost to the previous member of
the Date.Fiscal dimension hierarchy based on the current Date member selection.

Figure 11-18 Calculated member to determine the product costs for the previous period

Next you have a second calculated member (shown in Figure 11-19) that determines the
percent of Total Product Cost increase from the previous period (which could be last month,
last quarter, and so on) to the current period. Note that you’re evaluating the denominator
before performing any division to avoid any divide-by-zero exceptions.

Finally, on the KPIs tab (shown in Figure 11-20), you can create the KPI that evaluates the
percent of change in product cost. For this KPI, let’s start with the basics. If the product cost
has increased by only 5 percent or less from the prior month, quarter, or year, you’ll display a
green light, which means you consider a 5 percent or less increase to be good. If the cost has
increased by anywhere above 5 percent but less than 10 percent, you’ll display a yellow light,
which means you consider that marginal. If the cost has increased by more than 10 percent,
you consider that bad and will show a red light.
346 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 11-19 Second calculated member to determine the percent of increase (uses the first calculated
member)

Note You’ll recall from our discussion of KPI creation in Chapter 8 that the actual values that
the KPI returns are as follows: 1 for exceeded value, 0 for at value, or –1 for below value. The KPI
designer in BIDS allows you to select an icon set—such as green, yellow, or red traffic lights—to
represent these values visually.

Figure 11-20 The BIDS interface to create a KPI


Chapter 11 Advanced MDX 347

So, as a rule, a KPI contains (at least) the following:

■■ The name of the KPI.


■■ The associated measure group.
■■ The value associated with the KPI that the KPI will evaluate. (This value should be an
existing calculated member, not an in-line expression, which we’ll cover later in our dis-
cussion of KPI tips.)
■■ The goal (which could be a static number, calculated member, dimension property, or
measure).
■■ The status indicator (traffic light, gauge, and so on).
■■ The possible values for the status indicator. (If we’ve met our goal, return a value of 1
for green, and so on.)

After deploying the changes, you can test the KPI in SSMS with the following code (as shown
in Figure 11-21):

SELECT { [measures].[Internet Sales Amount],


[measures].[ProductCostPriorfiscalPeriod],
[measures].[ProductCostPctIncrease],
KPIvALuE("KPIProductCostPctIncrease"),
KPISTATuS("KPIProductCostPctIncrease") } on CoLumnS,
order(
filter([Product].[Product].Children,[Internet Sales Amount] > 0),
[ProductCostPctIncrease],BDESC) on rowS
from [Adventure works]
whErE [Date].[Q3 fY 2004]

Figure 11-21 Testing the KPI results in SSMS with a test MDX query
348 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Creating KPIs Programmatically


New in SQL Server 2008 is the ability to create KPIs programmatically. This is a welcome
enhancement for BI/database developers who preferred to script out KPI statements instead
of designing them visually. This is accomplished by the addition of the CREATE KPI statement.
As with the CREATE MEMBER statement, running the CREATE KPI statement from a query
tool, such as SSMS, creates these KPIs for the duration of that query session only.

There is also a new DROP KPI statement, which allows you to programmatically delete KPIs.
The KPI script capability in SQL Server 2008 allows you to write the same statements that you
placed in the designer back in Figure 11-19. The following code sample shows an example of
how you’d script out the same KPI definitions (for example, goal statements, status expres-
sions, and so on) you saw in Figure 11-19:

CrEATE KPI [Adventure works].[KPIProductCostPctIncrease]


AS [measures].[ProductCostPctIncrease]
, GoAL = .05
, STATuS = CASE
whEn KPIvALuE("KPIProductCostPctIncrease") <= KPIGoAL("KPIProductCostPctIncrease")
ThEn 1
whEn KPIvALuE("KPIProductCostPctIncrease") <= KPIGoAL("KPIProductCostPctIncrease") * 2
ThEn 0
ELSE -1 EnD
, STATuS_GrAPhIC = 'Traffic Light'
, CAPTIon = 'Product Cost Pct Increase'
, DISPLAY_foLDEr = 'KPIs';

Tip For more on creating KPIs programmatically, you can check out the following link:
http://msdn.microsoft.com/en-us/library/bb510608.aspx.

Additional Tips on KPIs


Here are a few tips for creating and testing KPIs:

■■ Some developers place calculation expressions in the KPI value expression. Although
this works, it also couples the calculated member to the KPI. In some reports, you
might want to use the calculated expression without actually displaying the KPI. So a
more manageable approach is to build the calculated expression as a separate calcu-
lated member. From there, you can refer to the calculated member by name when you
define your KPI value, and you can also use the calculated member independently of
the KPI.
■■ Create KPIs in your OLAP cubes rather than in client environments such as Excel, Office
SharePoint Server 2007, or PerformancePoint Server. Although you can create KPIs
using client tools, we prefer to create KPIs centrally because creating them outside of
the SSAS OLAP cube creates a potential maintenance issue.
Chapter 11 Advanced MDX 349
■■ The most effective way to test KPIs is outside of the BIDS environment. One way is to
write some MDX code in SSMS, as you saw back in Figure 11-20. Another way is to test
the KPIs in Excel.

Note There is an interesting project on CodePlex that showcases the use of programmatic
KPI creation. It is called the Analysis Services Personalization Extension. This CodePlex project
allows you to add calculations to a cube without requiring you to update and deploy an Analysis
Services project. In addition, you can customize calculations for specific users. Download the
sample application at http://www.codeplex.com/MSFTASProdSamples.

using MDX with SSrS and PerformancePoint Server


MDX is an important part of any serious data warehousing application. However, in the world
of fancy dashboards and reports, MDX is only as valuable as the reporting tools that support
it. Reporting Services and PerformancePoint Server are two popular tools for producing end-
user output in a data warehousing environment, and both support the incorporation of MDX
to produce a truly flexible reporting experience. At this point, let’s take a quick look at how
you can use MDX with the two tools.

Using MDX with SSRS 2008


SSRS 2008, like its predecessor (SSRS 2005), allows report authors to create reports against
SSAS OLAP cubes. In some instances, you can use the built-in graphical query tool to design
reports without writing MDX code. However, you might often have to override the query
designer and write your own custom MDX code if you want to use features of the MDX lan-
guage that the query designer doesn’t support. One example of this, which you’ll see in the
next couple of paragraphs, is the use of named sets. There are, of course, other features of
MDX that aren’t supported directly in the SSRS visual MDX query designer. As we’ll show in
Chapter 21, “Building Reports for SQL Server 2008 Reporting Services,” you can switch the
SSRS query designer in BIDS from visual to manual mode by clicking the design mode but-
ton on the embedded toolbar. In manual mode, you can simply type any MDX code that you
want to use.

Let’s take a look at an example where we’ll leverage the dynamic named set and ranking
function from the “About Named Sets” section of this chapter. Figure 11-22 shows our result,
a basic but functional SSRS report that shows the top 10 products based on geography and
date selection.

To get started, you can write the code in Figure 11-23, which hard-codes the customer and
date selection into the WHERE clause. (In the next step, you’ll change those to use query
parameters.)
350 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 11-22 Sample output of an MDX query in SSRS

Figure 11-23 The MDX query editor in SSRS 2008

Next, you click the MDX query parameters button on the toolbar (the fifth button from the
right), which is shown in Figure 11-23. SSRS 2008 then displays the Query Parameters dialog
box, where you can define two queries, for the Customer parameter and Date parameter, and
their corresponding dimension and hierarchy attribute settings. (See Figure 11-24.)
Chapter 11 Advanced MDX 351

Figure 11-24 Defining MDX query parameters in SSRS 2008

After you define the query parameters, you can modify the query shown in Figure 11-24 to
reference the query parameters. Note that SSRS uses the StrToSet function to convert the
parameters from a string value to an actual set. The following code sample shows the MDX
that is created. Note the WHERE clause, which uses the StrToSet function to convert the
named parameter values to sets. The result is shown in Figure 11-25.

SELECT {[measures].[Internet Gross Profit],


[measures].[ProductProfitrank]} on CoLumnS,
[Top10ProductsByProfit] on rowS
from [Adventure works]
whErE ( StrToSet(@CustomerParm), StrToSet(@DateParm))

Figure 11-25 Using StrToSet and MDX parameters in SSRS 2008


352 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Using MDX with PerformancePoint Server 2007


Although PerformancePoint Server 2007 contains many powerful built-in features, you’ll
often need to incorporate bits and pieces of MDX to build flexible output. We want to take
a few moments and provide a quick walkthrough of a PerformancePoint Server chart that
uses MDX.

Once again, we’ll show the result first. Figure 11-26 shows a line chart that plots the last 12
months of available data for a user-defined geography and product combination. Although
a user could normally build this chart in PerformancePoint Server 2007 without using MDX,
the chart also has a requirement to plot monthly sales for all siblings of the product selection.

In Figure 11-26, the chart plots sales for Mountain Bikes, Road Bikes, and Touring Bikes when
the user selects Mountain Bikes (because all three belong to the same parent, Bikes). For
this, you need to use the MDX Siblings function, which the PerformancePoint Server 2007
designer doesn’t really support. So you need to write some custom MDX.

Figure 11-26 The desired output in PerformancePoint Server 2007—a line chart that shows monthly sales for
a selected product and its siblings

The PerformancePoint Server 2007 designer allows you to override the graphical designer
and write your own custom MDX. Your MDX code will reference an existing named set called
[Last12Months] and also account for user-defined parameters for geography and prod-
uct. The named set is simply a convenience to make your code more readable. It consists
Chapter 11 Advanced MDX 353

of the last 12 month–level members of the time hierarchy and is defined using the syntax
that we covered at the beginning of this chapter in “Querying Dimension Properties.” In
PerformancePoint Server 2007, you reference parameters with << and >> tokens, as you can
see in the following code sample:

SELECT [Internet Sales Amount] *


<<Prodfilter>>.SIBLInGS on rowS,
[Last12months] on CoLumnS
from [Adventure works]
whErE (<<Geofilter>> )

This code is entered into the MDX query editor, shown in Figure 11-27.

Figure 11-27 The MDX editor in PerformancePoint Server 2007, with the ability to code parameters using
<<parm>> tokens

At the bottom of the MDX code entry page, you can define GeoFilter and ProdFilter with
default values, as shown in Figure 11-28.

Figure 11-28 Defining MDX parameters in PerformancePoint Server 2007

The next step is to build filter sections in the dashboard page. In Figure 11-29, you define
two filters, for GeographyDownToState and ProductsDownToSubcategory, so that the user can
select from subset lists that only go down as far as State and SubCategory. As with the named
set called Last12Months that we used in the earlier examples, both GeographyDownToState
and ProductsDownToSubcategory are named sets that we’ve created to improve the readabil-
ity of the code that will use these values.
354 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 11-29 Building filter definitions in PerformancePoint Server 2007

The final major step is to add all three components (the two filters and the chart) onto a
dashboard page. (See Figure 11-30.) In addition, you also need to add filter links between the
two filters and the chart parameters (not shown here). You do this by creating filter defini-
tions and the links between the filter and the chart in PerformancePoint Server using the
Dashboard Designer.

Figure 11-30 Building a sample dashboard in PerformancePoint Server 2007, with two filter definitions and
the chart

Summary
In this chapter, you took a deeper look at working with MDX in the SSAS environment. In
this chapter (and the previous one), we not only covered advanced functions such as IIf and
ParallelPeriod, but also explored concepts such as scripting KPIs and named sets. We also
looked at SOLVE_ORDER, and closed the chapter with an introduction to MDX syntax in SSRS
and PerformancePoint Server. This included the use of other features, such as the StrToSet
function and MDX parameters.
Chapter 12
Understanding Data Mining
Structures
We have completed our tour of Microsoft SQL Server Analysis Services OLAP cubes and
dimension design, development, refinement, processing, building, and deploying, but we still
have much more to do in Analysis Services and the Business Intelligence Development Studio
(BIDS). In this chapter and in Chapter 13, “Implementing Data Mining Structures,” we explore
the rich world of Analysis Services data mining structures. In this chapter, we review the busi-
ness situations that warrant the use of Analysis Services data mining models and explain how
to determine which specific data mining algorithms work best for various business needs. We
continue to use BIDS as our development environment—for design, development, tuning,
and so on. We have a lot of information to cover, so let’s get started!

Reviewing Business Scenarios


As we’ve discussed, you can think of the data mining functionality included in SSAS as a set
of tools to give your end users the ability to discover patterns and trends based on defined
subsets of your data. The source data can be relational or multidimensional. You can simply
review the results of applying data mining algorithms to your data and use those results as
a basis for making business decisions. You can also use the results as a basis for processing
new data. Microsoft often called the data mining functionality available in SSAS predictive
analytics because this set of tools is seen as a way for a business to proactively understand
its data. An example of this would be to design (or refine) a marketing strategy based on the
results of data clustering. Alternatively, you can use the result of this analysis to help you
to predict future values based on feeding new data into the validated model. Data mining
is meant to be complementary to an SSAS cube. A cube is often used to verify results—in
other words, to answer the question “We think this happened, does the data support our
belief?” Mining structures are used to discover correlations, patterns, and other surprises in
the data—in other words, “What will happen?” Another common use of mining is when busi-
nesses buy competitive data; mining can be used to help businesses answer questions like
“What if we got into this type of new business?” and “What if we started doing business in
these locations?”

In SQL Server 2008, Microsoft continues to focus on making data mining models easier for
you to implement and the results easier for your users to understand. Data mining can be
one of the most challenging types of data analysis solutions to put into operation because
of the need to deeply understand the various algorithms involved. Traditionally, data mining
products were used only by companies that had substantial resources: The specialized data
355
356 Part II Microsoft SQL Server 2008 Analysis Services for Developers

mining products were expensive, and consultants had to be hired to implement the com-
plex algorithms included in those products. It was not uncommon for those working in the
data mining industry to have advanced degrees in mathematics, particularly in the area of
statistics.

The general product goal for SSAS—BI for everyone—is extended to data mining. In fact,
Microsoft usually refers to data mining as predictive analysis because it believes that the title
more properly describes the accessibility and usage of the data mining toolset in SQL Server
2008.

The tools provided in BIDS make creating mining structures easy. As with OLAP cubes, data
mining structures are created via wizards in BIDS. Tools to help you verify the accuracy of
your particular mining model and to select the most appropriate algorithms are also avail-
able. Your users also benefit, by having meaningful results presented in a variety of ways.
Both BIDS and SSMS include many data mining model viewers to choose from, and you can
tailor data mining results to the appropriate audience. The client integration in Microsoft
Office Excel that was introduced in SQL Server 2005 has been significantly enhanced for SQL
Server 2008. An API is also included so that you can do custom development and integration
into any type of user application, such as Windows Forms, Web Forms, and so on.

Note The model viewers in BIDS or SSMS are not intended to be used by end users. They are
provided for you so that you can better understand the results of the various mining models
included in your mining structure. You may remember that these viewers are available in Excel
2007 with the SQL Server 2008 Data Mining Add-ins for Office installed. Of course, Excel 2007 is
often used as an end-user client tool. So, although the viewers in BIDS or SSMS aren’t meant to
be accessed by end users from BIDS, these same viewers are often used from within Excel 2007
by end users. This allows you, the developer, to have a nearly identical UI (from within BIDS or
SSMS) as that of your end users. These viewers are also available as embeddable controls for
developers to include in custom end-user applications.

This version of SSAS has tremendously enhanced the available methods for using data min-
ing. These methods are expressed as algorithms—nine algorithms are included in SSAS 2008,
and we discuss them in detail later in this chapter. Although some enhancements have been
made to tuning capabilities and performance for SQL Server 2008, these algorithms provide
nearly the same functionality as they did in SQL Server 2005.

One of the most challenging aspects of data mining in SSAS is understanding what the vari-
ous algorithms actually do and then creating a mining structure that includes the appropri-
ate algorithm or algorithms to best support your particular business requirements. Another
important consideration is how you will present this information to the end users. We believe
that these two concerns have seriously reduced implementation of data mining solutions
as part of business intelligence solutions. We find that neither developers nor end users can
Chapter 12 Understanding Data Mining Structures 357

visualize the potential benefits of SQL Server 2008 data mining technologies if developers
can’t provide both groups with reference samples.

For you to be able to build such samples, you’ll have to first think about business challenges
that data mining technologies can impact. The following list is a sample of considerations
we’ve encountered in our work designing business intelligence solutions for customers:

■■ What characteristics do our customers share? How could we group them, or put the
types of customers into buckets? This type of information could be used, for example,
to improve effectiveness of marketing campaigns by targeting different campaign
types more appropriately, such as using magazine ads for customers who read maga-
zines, TV ads for customers who watch TV, and so on.
■■ What situations are abnormal for various groups? This type of analysis is sometimes
used for fraud detection. For example, purchasing behavior outside of normal locations,
stores, or total amounts might be indicative of fraud for particular customer groups.
■■ What products or services should be marketed or displayed next to what other prod-
ucts or services? This is sometimes called market-basket analysis and can be used in
scenarios such as deciding which products should be next to each other on brick-and-
mortar store shelves, or for Web marketing, deciding which ads should be placed on
which product pages.
■■ What will a certain value be (such as rate of sales per week) for an item or set of items
at some point in the future, based on some values (such as the price of the item) that
the item had in the past? An example of this would be a retailer that adjusts the price of
a key item upward or downward based on sell-through rate for that price point for that
type of item for particular groups of stores, thereby controlling the amount of inven-
tory in each store of that particular item over time.

As we dive deeper into the world of data mining, we’ll again use the sample Adventure Works
DW 2008 data, which is available for download from CodePlex at http://www.codeplex.com/
MSFTDBProdSamples/Release/ProjectReleases.aspx?ReleaseId=16040. To do this, open the
same Adventure Works solution file that we’ve been using throughout this book in BIDS. The
sample contains both OLAP cubes and data mining structures. When working with this sam-
ple, you can work in interactive or disconnected mode in BIDS when designing data mining
models, just as we saw when working with OLAP cubes. We’ll start by working in interactive
(connected) mode. You’ll note that the sample includes five data mining structures. We’ll use
these for the basis of our data mining discussion in this chapter.

Figure 12-1 (shown in disconnected mode) shows the sample data mining containers, called
mining structures, in Solution Explorer in BIDS. Each mining structure contains one or more
data mining models. Each mining model is based on a particular algorithm. As we drill in,
we’ll understand which business situations the selected algorithms are designed to impact.
358 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 12-1 The sample Adventure Works cube contains five different mining structures.

Categories of Data Mining Algorithms


You create a data mining structure in BIDS by using the Data Mining Wizard. As with OLAP
cubes, when you create a new data mining structure, you must first define a data source and
a data source view to be used as a basis for the creation of the new data mining structure.
Data mining structures contain one or more data mining models. Each data mining model
uses one of the nine included data mining algorithms.

It is important that you understand what capabilities are included in these algorithms. Before
we start exploring the individual algorithms, we’ll first discuss general categories of data min-
ing algorithms: classification, clustering, association, forecasting and regression, sequence
analysis and prediction, and deviation analysis. This discussion will focus on the types of busi-
ness problems data mining algorithms are designed to impact. Next we’ll discuss which SSAS
data mining algorithms are available in which category or categories.

Classification
With classification, the value of one more fixed variables is predicted based on multiple input
variables (or attributes). These types of algorithms are often used when a business has a large
volume of high-quality historical data. The included algorithm most often used to imple-
ment this technique is Microsoft Decision Trees. The Microsoft Naïve Bayes and the Neural
Network algorithms can also be used. The Naïve Bayes algorithm is so named because it
assumes all input columns are completely independent (or equally weighted). The Neural
Network algorithm is often used with very large volumes of data that have very complex
relationships. With this type of source data, Neural Network will often produce the most
meaningful results of all of the possible algorithms.
Chapter 12 Understanding Data Mining Structures 359

Clustering
In clustering, source data is grouped into categories (sometimes called segments or buckets)
based on a set of supplied values (or attributes). All attributes are given equal weight when
determining the buckets. These types of algorithms are often used as a starting point to help
end users better understand the relationships between attributes in a large volume of data.
Businesses also use algorithms that create grouping of attributes, such as clustering-type
algorithms, to make more intelligent, like-for-like predictions: If this store is like that store in
these categories, it should perform similarly in this category. The included algorithm most
often used to implement this technique is the Microsoft Clustering algorithm.

Association
Finding correlations between variables in a set of data is called association, or market-basket
analysis. The goal of the algorithm is to find sets of items that show correlations (usually
based on rates of sale). Association is used to help businesses improve results related to
cross-selling. In brick-and-mortar locations, the results can be used to determine shelf place-
ment of products. For virtual businesses, the results can be used to improve click-through
rates for advertising. The included algorithm most often used to implement this technique is
the Microsoft Association algorithm.

Forecasting and Regression


Similar to classification, that is, predicting a value, forecasting and regression are based on
multiple input variables. The difference is that the predictable value is a continuous number.
In forecasting, the input values usually contain data that is ordered by time. This is called a
time series. Businesses use regression algorithms to predict the rate of sale of an item based
on retail price, position in store, and so on, or to predict amount of rainfall based on humid-
ity, air pressure, and temperature. The included algorithm most often used to implement this
technique is the Microsoft Time Series. The Linear and Logistical Regression algorithms can
also be used.

Sequence Analysis and Prediction


Sequence analysis and prediction find patterns in a particular subset of data. Businesses can
use this type of algorithm to analyze the click-path of users through a commercial Web site.
These paths or sequences are often analyzed over time—for example, what items did the
customer buy on the first visit? What did the customer buy on the second visit? Sequence
and association algorithms both work with instances (called cases in the language of data
mining) that contain a set of items or states. The difference is that only sequence algorithms
analyze the state transitions (the order or time series that cases occurred). Association algo-
rithms consider all cases to be equal. The included algorithms most often used to implement
this technique are Microsoft Sequence Clustering or Time Series.
360 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Deviation Analysis
Deviation analysis involves finding exceptional cases in the data. In data mining (and in other
areas, such as statistics), such cases are often called outliers. Businesses use this type of algo-
rithm to detect potential fraud. One example is credit card companies who use this technique
to initiate alerts (which usually result in a phone call to the end user, asking him or her to
verify a particularly unusual purchase based on location, amount, and so on). The most com-
monly used included algorithms for this type of analysis are Microsoft Decision Trees used in
combination with one or more other algorithms (often Microsoft Clustering).

Working in the BiDS Data Mining interface


We’ll use the AdventureWorks BI sample as a basis for understanding the BIDS interface for
data mining. The sample includes five data mining structures. Each structure includes one or
more data mining models. Each model is based on one of the included of data mining algo-
rithms. As with the OLAP designer, you’ll right-click folders in the Solution Explorer window
to open wizards to create new data mining objects. One difference between working with
OLAP objects and data mining objects in BIDS is that for the latter you’ll use the Properties
window more frequently to perform tuning after you’ve initially created data mining objects.
Figure 12-2 shows the BIDS data mining structure design interface. Note the five tabs in the
designer: Mining Structure, Mining Models, Mining Model Viewer, Mining Accuracy Chart,
and Mining Model Prediction. The Properties window in the figure is highlighted.

FiguRe 12-2 The BIDS designer for data mining structures


Chapter 12 Understanding Data Mining Structures 361

Tip If you’re completely new to data mining, you might want to skip to Chapter 23, “Using
Microsoft Excel 2007 as an OLAP Cube Client,” and read the explanation of the SQL Server 2008
Data Mining Add-ins for Excel 2007. Specifically, you’ll be interested to know that in addition to
being a client interface for Analysis Services data mining, the Data Mining tab on the Excel 2007
Ribbon is also designed to be a simpler, alternative administrative tool to create, edit, and query
data mining models. We have found that using Excel first provides a more accessible entry into
the capabilities of SSAS data mining, even for developers and administrators.

In Figure 12-2 the Properties window shows some values in the Data Type section that are
probably new to you, such as Content, DiscretizationBucketCount, and so on. We’ll be explor-
ing these values in greater detail in the next section.

Understanding Data Types and Content Types


Analysis Services data mining structures use data and content types specific to the Microsoft
implementation of data mining. You need to understand these types when you build your
mining structures. Also, certain algorithms support only certain content types. We’ll start by
explaining the general concepts of content and data type assignments and then, as we work
our way through more detailed explanations of each algorithm, we’ll discuss requirements for
using specific types with specific algorithms.

A data type is a data mining type assignment. Possible values are Text, Long, Boolean, Double,
or Date. Data types are detected and assigned automatically when you create a data mining
structure.

A content type is an additional attribute that a mining model algorithm uses to understand
the behavior of the data. For example, marking a source column as a Cyclical content type
tells the mining algorithm that the order of the data is particular, important, and repetitive,
or has a cycle to it. One example is the month numbers of more than one year in a time
table.

The rule of thumb is to determine the data type first, then to verify (and sometimes adjust)
the appropriate content type in your model. Remember that certain algorithms support
certain content types only. For example, Naïve Bayes does not support Continuous content
types. The Data Mining Wizard detects content types when it creates the mining structure.
The following list describes the content types and the data types that you can use with the
particular content types.

■■ Discrete The column contains distinct values—for example, a specific number of chil-
dren. It does not contain fractional values. Also, marking a column as Discrete does not
indicate that the order (or sequence) of the information is important. You can use any
data type with this content type.
362 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ Continuous The column has values that are a set of numbers representing some unit
of measurement. These values can be fractional. An example of this would be an out-
standing loan amount. You can use the date, double, or long data type with this con-
tent type.
■■ Discretized The column has continuous values that are grouped into buckets.
Each bucket is considered to have a specific order and to contain discrete values.
You saw an example of this in Figure 12-2 using the Age column in the Targeted
Mining sample. Note that you’ll also set the DiscretizationMethod and (optionally) the
DiscretizationBucketCount properties if you mark your column as Discretized. In our
sample, we’ve set the bucket size to 10 and DiscretizationMethod to Automatic. Possible
values for discretization method are automatic, equal areas, or clusters. Automatic
means that SSAS determines which method to use. Equal areas results in the input data
being divided into partitions of equal size. This method works best with data with regu-
larly distributed values. Clusters means that SSAS samples the data to produce a result
that accounts for “clumps” of data values. Because of this sampling, Clusters can be
used only with numeric input columns. You can use the date, double, long, or text data
type with the Discretized content type.
■■ Table The column contains a nested table that has one or more columns and one or
more rows. These columns can contain multiple values, but of these values at least one
value must be related to the parent case record. An example would be individual cus-
tomer information in a case table, with related customer purchase item information in a
nested transaction table.
■■ Key The column is used as a unique identifier for a row. You can use the date, double,
long, or text data type for this.
■■ Key Sequence The column is a type of a key—the sequence of key values is important
to your model. You can use the double, long, text, or date data type with this content
type.
■■ Key Time The Key Time column, similar to Key Sequence, is a type of key where the
sequence of values is important. Additionally, by marking your column with this content
type, you are indicating to your mining model that the key values run on a time scale.
You can use the double, long, or date data type with this content type.
■■ Ordered The column contains data in a specific order that is important for your min-
ing model. Also, when you mark a column with the Ordered content type, SSAS consid-
ers that all data contained is discrete. You can use any data type with this content type.
■■ Cyclical The column has data that is ordered and represents a set that cycles (or
repeats). This is often used with time values (months of the year, for example). Data
marked as Cyclical is considered both ordered and discrete. You can use any data type
with this content type.
Chapter 12 Understanding Data Mining Structures 363

Note There is also a designator named Classified for columns. This refers to the ability to
include information in a column that describes a different column in that same model. We rarely
use this feature because the standard SSAS data mining algorithms don’t support it.

Table 12-1 lists the data types and the content types they support. Keep in mind that under-
standing the concept of assigning appropriate content and data types is critical to successful
model building.

TaBle 12-1 Data Types and Content Types


Data Type Content Types Supported
Text Discrete, Discretized, Sequence
Long Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered (by
sequence or by time)
Boolean Discrete
Double Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered (by
sequence or by time)
Date Continuous, Discrete, Discretized, Key Time

You can specify both the data type and the content type by using the Data Mining Wizard
or by configuring the Properties windows in BIDS. Note that to use the Key Time and Key
Sequence content types you have to install additional algorithms—these content types are
not supported by the algorithms included with SSAS.

Setting Advanced Data Properties


In addition to data and content types, you may wish to specify a few other properties so that
the algorithm you select understands the source data better. This understanding improves
the results produced by the algorithm. These properties are as follows:

■■ Modeling Flags These vary by algorithm but usually include the NOT NULL flag at a
minimum.
■■ Relationship (between attributes) This is available only by using the DMX clause
Related To between two attribute columns. It is used to indicate natural hierarchies and
can be used with nested tables (defined later in this chapter).
■■ Distribution Normal indicates that the source data distribution resembles a bell-
shaped histogram. Uniform indicates that the source data distribution resembles a flat
curve where all values are equally likely. Log Normal indicates that the source data
distribution is elongated at the upper end only. This attribute configuration is optional;
you generally use this only when you are using source data that is dirty or does not
represent what you expect, such as feeding data that really should match a (normal)
bell curve, but doesn’t.
364 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In BIDS, the first tab in the mining structure designer is named Mining Structure. Here you
can see the source data included in your mining model (which is, in fact, a DSV). As with the
OLAP cube designer in BIDS, in this work area you can only view the source data from the
DSV—you cannot change it in any way, such as by renaming columns, adding calculated
columns, and so on. You have to use the original DSV to make any structural changes to the
DSV. In this view you are only allowed to view the source data, add or remove columns, or
add nested tables to the mining structure.

You can use either relational tables or multidimensional cubes as source data for Analysis
Services data mining structures. If you choose relational data, that source data can be
retrieved from one or more relational tables, each with a primary key. Optionally the source
data can include a nested table. An example of this would be customers and their orders:
The Customers table would the first table and Orders would be a nested table. This situation
would require a primary key/foreign key relationship between the rows in the two tables as
well. One way to add a nested table to a data mining structure is to right-click the Object
Browser tree on the Mining Structure tab, and then click Add A Nested Table. You are then
presented with the dialog box shown in Figure 12-3. Select the table you wish to nest from
the DSV. Note that you can filter the columns in the nested table by data type.

Tip It is very important that you understand and model nested tables correctly (if you are using
them). For more information read the SQL Server Books Online topic “Nested Tables (Analysis
Services—Data Mining).”

FiguRe 12-3 Nested tables in a mining model structure

You can configure several properties for the data mining structure in this view. An example
is the CacheMode property. Your choices are KeepTrainingCases or ClearAfterProcessing. The
latter option is often used during the early development phase of a mining project. You may
process an individual mining model only to find that the data used needs further cleaning. In
this case, you’d perform the subsequent cleaning, and then reprocess that model.
Chapter 12 Understanding Data Mining Structures 365

Alternatively, you can perform a full process on the entire mining structure. If you do this, all
mining models that you have defined inside of the selected mining structure are processed.
As with OLAP cube processing, data mining processing includes both processing of metadata
and data itself. Metadata is expressed as XMLA; data is retrieved from source systems and
loaded into destination data mining structures using DMX queries. Chapter 13 includes more
detail on this process.

Choosing a Data Mining Model


The next tab in the mining structure designer in BIDS is the Mining Models tab. Here you
view the mining model(s) that you’ve included in the selected mining structure. You can eas-
ily add new models to your structure by right-clicking the designer surface and then clicking
New Mining Model. You can also change the source data associated with a type of min-
ing model by creating more than one instance of that model and “ignoring” one or more
columns from the mining structure DSV. Ignoring a column of data in a particular mining
model is shown (for the Yearly Income column) using the Microsoft Naïve Bayes algorithm in
Figure 12-4.

FiguRe 12-4 The Mining Models tab in BIDS allows you to specify the usage of each source column for each
mining model.

You can change the use of the associated (non-key) source columns in the following ways:

■■ Ignore This setting causes the model to remove the column from the model.
■■ Input This setting causes the model to use that column as source data for the model.
■■ Predict This setting causes the model to use that column as both input and output.
■■ PredictOnly This setting causes the model to use that column as output only.
366 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Note The specific requirements for input and predictable columns for each type of algorithm
are discussed in Chapter 13.

Nested tables are another point to consider when you are deciding how to mark columns
for use by data mining algorithms. If your source model includes a nested table and you’ve
marked that table as Predict (or PredictOnly), all of its nested attributes are automatically
marked as predictable. For this reason, you should include only a small number of attributes
in a nested table marked as predictable.

As we continue our tour of Analysis Services data mining, it is interesting to note that the
Mining Models tab is designed to support a key development activity—that is, building
multiple models using the same source data. You may wonder why you’d want to do that.
Because we haven’t explored the algorithms in detail yet, this capability will probably be a bit
mysterious to you at this point. Suffice it to say that this ability to tinker and adjust by add-
ing, tuning, or removing mining models is a key part of using SSAS data mining successfully.
Predictive analytics is not an exact science; it is more of an art. You apply what you think will
be the most useful algorithms to get the best results. Particularly at the beginning of your
project, you can expect to perform a high number of iterations to get this tuning just right.
You will inevitably try adjusting the amount of input data columns, the algorithm used, the
algorithm parameters, and so on so that you can produce useful results. You’ll also test and
score each model as you proceed; in fact, SSAS includes tools to do this so that you can
understand the usefulness of the various mining model results.

We will return to this topic later in this chapter after we review the capabilities of the
included algorithms.

Filtering is a new capability in SQL Server 2008 data mining models. You can build mining
models on filtered subsets of the source data without having to create multiple mining struc-
tures. You can create complex filters on both cases and nested tables using BIDS. To imple-
ment filtering, right-click the model name on the BIDS Mining Models tab and then click Set
Model Filter. You are then presented with a blank model filter dialog box, where you can
configure the desired filter values. We show a sample in Figure 12-5.

Another enhancement to model building included in SQL Server 2008 is the ability to alias
model column names. This capability allows for shorter object names. You can implement
it using BIDS or with the ALTER MINING STRUCTURE (DMX) syntax. To use this new syntax,
you must have first created a mining structure. Then you can create another structure (which
can include aliased column names) based on your original structure. For detailed syntax, see
the SQL Server Books Online topic “ALTER MINING STRUCTURE (DMX)” at http://msdn.micro-
soft.com/en-us/library/ms132066.aspx.
Chapter 12 Understanding Data Mining Structures 367

FiguRe 12-5 Mining models based on filtered subsets of source data

You can configure the algorithm parameters for each mining model in the mining structure
by right-clicking the mining model on the designer surface and then clicking Set Algorithm
Parameters. The available parameters vary depending on which mining algorithm you are
working with and the edition of SQL Server 2008. Several advanced configuration properties
are available in the Enterprise edition of SQL Server 2008 only. Figure 12-6 shows the config-
urable parameters for the Microsoft Decision Trees model. Note that when you select one of
the properties, the configuration dialog box shows you a brief definition of the configurable
property value.

FiguRe 12-6 The Algorithm Parameters dialog box in BIDS, in which you can manually configure advanced
algorithm properties
368 Part II Microsoft SQL Server 2008 Analysis Services for Developers

As you become a more advanced user of data mining, for select algorithms, you may also
add your own custom parameters (and configure their values) via this dialog box. Of course
the available parameters vary for each included algorithm. You should document any
changes you make to default values and you should have a business justification for mak-
ing such changes. In many cases you will find that you don’t need to make any changes to
default values—in fact, making changes without a full understanding can result in decreased
performance and overall effectiveness of the selected algorithm.

Picking the Best Mining Model Viewer


The next tab in the mining structure designer in BIDS is the Mining Model Viewer. An inter-
esting aspect of this section is that each mining model algorithm includes one or more
types of mining model viewers. The purpose of the broad variety of viewers is to help you
to determine which mining model algorithms are most useful for your particular business
scenario. The viewers include both graphical and text (rows and columns) representations
of data. Some of the viewers include multiple types of graphical views of the output of the
mining model data. Additionally, some of the viewers include a mining legend shown in the
Properties window of the designer surface. Each algorithm has a collection of viewers that is
specific to that algorithm. These viewers usually present the information via charts or graphs.
In addition, the Microsoft Generic Content Tree Viewer is always available, providing very
detailed information about each node in the mining model.

Tip The mining model viewers are also available in SSMS. To access them, connect to SSAS in
SSMS, right-click the particular mining structure in the Object Explorer, and then click Browse.
The viewers are also available in Excel 2007 (after you’ve installed the SQL Server 2008 Data
Mining Add-Ins for Office 2007) via the Data Mining tab on the Ribbon. Excel is intended as an
end-user interface. Application developers can also download embeddable versions of these
viewers and incorporate them into custom applications. You can download the embeddable con-
trols at http://www.sqlserverdatamining.com/ssdm/Home/Downloads/tabid/60/Default.aspx.

The following list shows the nine data mining algorithms available in SQL Server 2008.
Although some of the algorithms have been enhanced, no new algorithms are introduced in
this product release. This list is just a preview; we cover each algorithm in detail later in the
chapter.

■■ Microsoft Association
■■ Microsoft Clustering
■■ Microsoft Decision Trees
■■ Microsoft Linear Regression
■■ Microsoft Logistic Regression
Chapter 12 Understanding Data Mining Structures 369
■■ Microsoft Naïve Bayes
■■ Microsoft Neural Network
■■ Microsoft Sequence Clustering
■■ Microsoft Time Series

Again using the Adventure Works DW 2008 sample, we now look at some of the viewers
included for each algorithm. Using the sample mining structure named Targeted Mailing, we
can take a look at four different viewers, because this structure includes four mining mod-
els, each based on a different mining model algorithm. After you open this structure in the
designer in BIDS and click the Mining Model Viewer tab, the first listed mining model, which
is based on the Microsoft Decision Trees algorithm, opens in its default viewer.

Note Each algorithm includes one or more viewer types. Each viewer type contains one or more
views. An example of this is that the Microsoft Decision Trees algorithm, which ships with two
viewer types: Microsoft Tree Viewer and Microsoft Generic Content Tree Viewer. The Microsoft
Tree Viewer contains two views: Decision Tree and Dependency Network. The Microsoft Generic
Content Tree Viewer contains a single view of the same mining model, but in a different visual
format. Are you confused yet? This is why we prefer to start with the visuals!

In addition to the two types of views shown in the default viewer (Figure 12-7) of the
Microsoft Decision Trees algorithm, you can further customize the view by adjusting view
parameters. The figure shows a portion of the Decision Tree view with its associated mining
legend. It shows the most closely correlated information at the first level, in this case, number
of cars owned. The depth of color of each node is a visual cue to the amount of association—
darker colors indicate more association. Note that the mining legend reflects the exact
number of cases (or rows) for the particular node of the model that is selected. It also shows
the information via a probability column (percentage) and a histogram (graphical represen-
tation). In the diagram, the selected node is Number Of Cars Owned = 2. We’ve also set the
Background filter to 1, indicating that we wish to see data for those who actually purchased a
bicycle, rather than for all cases. Note also that the default setting for levels is 3. This particu-
lar model contains six levels; however, viewing them all on this surface is difficult.

Note If you set the levels or default expansion settings to the maximum included in the
model (six), you can observe one of the challenges of implementing data mining as part of a BI
solution—effective visualization. We’ll talk more about this topic as we continue on through the
world of SSAS data mining; suffice to say at this point that we’ve found the ability to provide
appropriate visualizations of results to be a key driver of success in data mining projects. The
included viewers are a start in the right direction; however, we’ve found that in the real world it is
rare for our clients or other developers to spend enough time thoroughly understanding what is
included before they try to buy or build other visualizers.
370 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 12-7 The Microsoft Tree Viewer for the Microsoft Decision Trees algorithm

The other type of built-in view for the Microsoft Tree Viewer in BIDS for this algorithm is the
Dependency Network. This view allows you to quickly see which data has the strongest cor-
relation to a particular node. You can adjust the strength of association being shown by drag-
ging the slider on the left of the diagram up or down. Figure 12-8 shows the Dependency
Network for the same mining structure that we’ve been working with. You’ll note that the
three most correlated factors for bike purchasing are yearly income, number of cars owned,
and region.

Tip We’ve found that the Dependency Network view is one of the most effective and univer-
sally understood. We use it early and often in our data mining projects. Remember that all of the
viewers found in BIDS are available in SSMS and, most important, as part of Excel 2007 after you
install the SQL Server 2008 Data Mining Add-ins for Office 2007. Depending on the sophistica-
tion and familiarity of the client (business decision maker, developer, or analyst), we’ve sometimes
kept our initial discussion of viewers and algorithms to Excel rather than BIDS. We do this to
reduce the complexity of what we are demonstrating. We find this approach works particularly
well in proof-of-concept discussions.
Chapter 12 Understanding Data Mining Structures 371

FiguRe 12-8 The Dependency Network view for Microsoft Decision Trees algorithm

As with the Decision Tree view for the Microsoft Tree Viewer, the Dependency Network
view includes some configurable settings. Of course, the slider on the left is the most pow-
erful. Note that this view, like most others, also contains an embedded toolbar that allows
you to zoom/pan and to otherwise tinker with the viewable information. At the bottom of
this viewer you’ll also note the color-coded legend, which assists with understanding of the
results.

In contrast to the simplicity of the Dependency Network view, the Generic Content Tree
Viewer shown in Figure 12-9 presents a large amount of detail. It shows the processed results
in rows and columns of data. For certain mining models, this viewer will include nested tables
in the results as well. This viewer includes numeric data for probability and variance rates.
We’ve found that this level of information is best consumed by end users who have a formal
background in statistics. In addition to the information presented by this default viewer, you
can also query the models themselves using the DMX and XMLA languages to retrieve what-
ever level of detail you desire for your particular solution.
372 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 12-9 The Microsoft Generic Content Tree Viewer for the Microsoft Decision Trees algorithm is
complex and detailed.

To continue our journey through the available viewers, select TM Clustering from the Mining
Model drop-down list on the Mining Model Viewer tab. You can see that you have new types
of viewers to select from. The Microsoft Cluster Viewer includes the following four differ-
ent views, which are reflected in nested tabs in the viewer: Cluster Diagram, Cluster Profiles,
Cluster Characteristics, and Cluster Discrimination. We suggest that you now take the time to
look at each sample data mining structure that is included in the AdventureWorksDW 2008
sample, its associated models, and their associated viewers. Looking at these visualizers will
prepare you for our discussion of the guts of the algorithms, which starts shortly.

One capability available in the viewers might not be obvious. Some of the included algo-
rithms allow the viewing of drillthrough data columns. An improvement to SQL Server 2008
is that for some algorithms drillthrough is available to all source columns in the data mining
structure, rather than just those included in the individual model. This allows you to build
more compact models, which is important for model query and processing performance. If
you choose to use drillthrough, you must enable it when you create the model.

The following algorithms do not support drillthrough: Naïve Bayes, Neural Network, and
Logistic Regression. The Time Series algorithm supports drillthrough only via a DMX query,
not in BIDS. Figure 12-10 shows the results of right-clicking, clicking Drill Through, and then
clicking Model And Structure Columns on the Cluster 10 object in the Cluster Diagram view
Chapter 12 Understanding Data Mining Structures 373

of the TM Clustering model. Of course, if you choose to use drillthrough in your BI solution,
you must verify that any selected end-user tools also support this capability.

FiguRe 12-10 Drillthrough results window from the Mining Model Viewer in BIDS for the TM Clustering
sample

Before we begin our investigation of the data mining algorithms, let’s briefly look at the
other two tabs that are part of the BIDS mining structure designer. At this point we just want
to get a high-level overview of what type of activity is performed here. Because a great deal
of power (and complexity) is associated with these included capabilities, we take a deeper
look at the last two tabs (Mining Accuracy Chart and Mining Model Prediction) in the next
chapter.

Mining Accuracy Charts and Prediction


The next tab in the BIDS mining structure designer is the Mining Accuracy Chart tab. Here
you can validate (or compare) your model against some actual data to understand how
accurate your model is and how well it will work to predict future target values. This tool
is actually quite complex, containing four nested tabs named Input Selection, Lift Chart,
Classification Matrix, and Cross Validation.

You might be surprised that we are looking at these sections of the mining tools in BIDS
before we review the detailed process for creating mining structures and models. We have
found in numerous presentations that our audiences tend to understand the whys of model
structure creation details more thoroughly if we first present the information in this sec-
tion. So bear with us as we continue our journey toward understanding how to best design,
develop, validate, and use mining models.

Note The interface for this tab has changed in a couple of ways in SQL Server 2008. One way
reflects a change to the capabilities of model building. That is, now the Model Creation Wizard
includes a page that allows you to automatically create a training (sub) set of your source data.
We talk more about this in Chapter 13 when we go through that wizard.
374 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Figure 12-11 shows the Input Selection nested tab of the Mining Accuracy Chart tab. Note
that you can configure a number of options as you prepare to validate the mining models in
a particular mining structure. The options are as follows:

■■ Synchronize Prediction Columns And Values (selected by default)


■■ Select one or more included Mining Models (all are selected by default)
■■ Configure the Predictable Column Name and value for each model. In this case all
models have only one predictable column. We’ve set the value to 1 (bike buyers).
■■ Select the testing set. Either use the set automatically created when the model was
created, or manually specify a source table or view. If you manually specify, then you
should validate the automatically generated column mappings by clicking the Build (…)
button.
■■ (Optional) Specify a filter for the manually associated testing set by creating a filter
expression.

FiguRe 12-11 The Mining Accuracy Chart tab allows you to validate the usefulness of one or more
mining models.

The results of the Mining Accuracy Chart tab are expressed in multiple ways, including a lift
chart, a profit chart, a classification matrix, or a cross validation. Generally they allow you to
assess the value of the results your particular model predicts. These results can be complex to
interpret, so we’ll return to this topic in the next chapter (after we’ve learned the mechanics
of building and processing data mining structures). Also note that the cross validation capa-
bility is a new feature introduced in SQL Server 2008.
Chapter 12 Understanding Data Mining Structures 375

The next tab in the BIDS designer is the Mining Model Prediction tab. Here you can create
predictions based on associating mining models with new external data. When you work
with this interface, what you are actually doing is visually writing a particular type of DMX—
specifically a prediction query. The DMX language contains several prediction query types and
keywords to implement these query types. We look at DMX in more detail in the next chapter.

When you first open this interface, the first data mining model in the particular structure will
be populated in the Mining Model window. You can select an alternative model from the
particular structure by clicking the Select Model button at the bottom of the Mining Model
window. The next step is to specify the source of the new data. You do this by clicking the
Select Case Table button in the Select Input Table(s) window. After you select a table, the
designer will automatically match columns from source and destination with the same names.
To validate the automatically generated column mappings, you simply right-click in the
designer and then click Modify Connections. This opens the Modify Mapping window.

After you’ve selected the source table(s) and validated the column mappings, you’ll use the
lower section of the designer to create the DMX query visually. By using the first button on
the embedded toolbar you can see the DMX that the visual query designer has generated, or
you can execute the query. Remember that you can write and execute DMX queries in SSMS
as well. Figure 12-12 shows the interface in BIDS for creating a prediction query.

FiguRe 12-12 The Mining Model Prediction tab allows you to generate DMX prediction queries.

We’ve now covered our initial tour through the BIDS interface for data mining. We certainly
haven’t seen everything yet. As we move to the next level, it’s time to dive deeper into the
particular capabilities of the included algorithms. Understanding what each does is, of course,
key to implementing SSAS data mining successfully in your BI project.
376 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Data Mining algorithms


Now we’ll begin to work through the capabilities of all included algorithms. We’ll take a look
at them in order of complexity, from the simplest to the most complex. For each algorithm,
we discuss capabilities, common configuration, and some of the advanced configurable prop-
erties. Before we start reviewing the individual algorithms, we cover a concept that will help
you understand how to select the mining algorithm that best matches your business needs.
This idea is called supervision—algorithms are either supervised or unsupervised. Supervised
mining models require you to select both input and predictable columns. Unsupervised
mining models require you to select only input columns. When you are building your mod-
els, SSAS presents you with an error dialog box if you do not configure your model per the
supervision requirements. The unsupervised algorithms are Clustering, Linear Regression,
Logistic Regression, Sequence Clustering, and Time Series. The supervised algorithms are
Association, Decision Trees, Naïve Bayes, and Neural Network.

Microsoft Naïve Bayes


Microsoft Naïve Bayes is one of the most straightforward algorithms available to you in SSAS.
It is often used as a starting point to understanding basic groupings in your data. This type of
processing is generally characterized as classification. The algorithm is called naïve because
no one attribute has any higher significance than another. It is named after Thomas Bayes,
who envisioned a way of applying mathematic (probability) principals to understanding
data. Another way to understand this is that all attributes are treated as independent, or not
related to one another. Literally, the algorithm simply counts correlations between all attri-
butes. Although it can be used for both predicting and grouping, Naïve Bayes is most often
used during the early phases of model building. It’s more commonly used to group rather
than to predict a specific value. Typically you’ll mark all attributes as either simple input or
both input and predictable, because this asks the algorithm to consider all attributes in its
execution. You may find yourself experimenting a bit when marking attributes. It is quite
typical to include a large number of attributes as input, then to process the model and evalu-
ate the results. If the results don’t seem meaningful, we often reduce the number of included
attributes to help us to better understand the most closely correlated relationships.

You might use Naïve Bayes when you are working with a large amount of data about which
you know little. For example, your company may have acquired sales data after purchasing a
competitor. We use Naïve Bayes as a starting point when we work with this type of data.

You should understand that this algorithm contains a significant restriction: Only discrete
(or discretized) content types can be evaluated. If you select a data structure that includes
data columns marked with content types other than Discrete (such as Continuous), those
columns will be ignored in mining models that you created based on the Naïve Bayes algo-
rithm. Only a small number of included configurable properties are in this algorithm. To view
Chapter 12 Understanding Data Mining Structures 377

the parameters, we’ll use the Targeted Mailing sample. Open it in BIDS, and then click the
Mining Models tab. Right-click the model that uses Naïve Bayes and then click Set Algorithm
Parameters. You’ll see the Algorithm Parameters dialog box, shown in Figure 12-13.

FiguRe 12-13 The Algorithm Parameters dialog box allows you to view and possibly change parameter
values.

Four configurable parameters are available for the Naïve Bayes algorithm: MAXIMUM_
INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, MAXIMUM_STATES, and MINIMUM_
DEPENDENCY_PROBABILITY. You can change the configured (default) values by typing the
new value in the Value column. As mentioned previously, configurability of parameters
is dependent on the edition of SQL Server you are using. This information is noted in the
Description section of the Algorithm Parameters dialog box.

You might be wondering how often you’ll be making adjustments to the default values for
the algorithm parameters. We find that as we become familiar with the capabilities of par-
ticular algorithms, we tend to begin using manual tuning. Because Naïve Bayes is frequently
used in data mining projects, particularly early in the project, we do find ourselves tinkering
with its associated parameters. The first three are fairly obvious: Adjust the configured value
to reduce the maximum number of input values, output values, or possible grouping states.
The last dependency is less obvious. When you reduce that value, you are asking for a reduc-
tion in the number of nodes or groups that the processed model produces.

Feature Selection
SSAS data mining includes a capability called feature selection. This setting is applied before
the model is trained (loaded with source data). Feature selection automatically chooses the
attributes in a dataset that are most likely to be used in the model. If feature selection is used
during mining model processing, you will see that it was in detailed execution statements
that are listed in the mining model processing window.

Feature selection works on both input and predictable attributes. It can also work on the
number of states in a column, depending on the algorithm being used in the mining model.
Only the input attributes and states that the algorithm selects are included in the model-
building process and can be used for prediction. Predictable columns that are ignored by
feature selection are used for prediction, but the predictions are based only on the global
statistics that exist in the model. To implement feature selection, SSAS uses various methods
378 Part II Microsoft SQL Server 2008 Analysis Services for Developers

(documented in the SQL Server Books Online topic “Feature Selection in Data Mining”) to
determine what is called the “interestingness score” of attributes. These methods depend on
the algorithm used.

It is important to understand that you can affect the invocation and execution of these vari-
ous methods of determining which attributes are most interesting to the model. You do this
by changing the configured values for the following mining model parameters: MAXIMUM_
INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, and MAXIMUM_STATES.

Tip The Algorithm Parameters dialog box shows only a partial list of the configurable param-
eters for each algorithm. For several algorithms, it shows none at all. If you wish to add a con-
figurable parameter and a value, click Add at the bottom of the dialog box. If you search on the
name of the particular algorithm in SQL Server Books Online, you can review the complete list
of configurable parameters for each algorithm. If you have experience with SQL Server 2005
data mining, you’ll notice that for many algorithms using SQL Server 2008 (BIDS), the Algorithm
Parameters dialog box will show you more configurable parameters than were shown in the 2005
edition.

In a nutshell, you can think of feature selection as a kind of a built-in improvement algo-
rithm—that is, it uses an internal algorithm to try to improve the quality of your mining
model results. If you are new to data mining, you’ll probably just want to let it run as is. As
you become a more advanced user, you may want to guide or override feature selection’s
execution using the method described the previous paragraph. We find feature selection
to be helpful because of the many unknowns you can encounter when you work with data
mining models: quality of data, uncertainty about which data to include in a model, choice
of algorithm, and so on. Feature selection attempts to intelligently narrow the results of data
mining model processing to create a more targeted and more meaningful result. We particu-
larly find this to be useful during the early stages of the data mining life cycle—for example,
when we are asked to mine new data, perhaps purchased from a competitor. We often use
the Naïve Bayes algorithm in such situations, and we particularly find feature selection useful
in combination with less precise algorithms such as Naïve Bayes.

The Microsoft Naïve Bayes Viewer includes four types of output views: Dependency Network,
Attribute Profiles, Attribute Characteristics, and Attribute Discrimination. We often use the
Dependency Network view because its output is easy to understand. It simply shows the
related attributes and the strength of their relationship to the selected node. (You adjust the
view by using the slider at the left of the view.) This view (also included with the Microsoft
Tree Viewer) was shown in Figure 12-8. The Attribute Profiles view, a portion of which is
shown in Figure 12-14, shows you how each input attribute relates to each output attribute.
You can rearrange the order of the attributes shown in this view by clicking and dragging
column headers in the viewer.
Chapter 12 Understanding Data Mining Structures 379

FiguRe 12-14 The Attribute Profiles view for Naïve Bayes provides a detailed, attribute-by-attribute look at
the algorithm results.

You can change the view by adding or removing the States legend, by changing the number
of histogram bars, or by changing the viewed predictable attribute (though only if you’ve
built a model that contains more than one predictable value, of course). You can also hide
columns by right-clicking the desired column and then clicking Hide. Note that the options
below the Drillthrough option on the shortcut menu are not available. This is for two reasons:
First, drillthrough is not enabled by default for any mining model. Second, the algorithm on
which this particular sample model is built, Naïve Bayes, does not support drillthrough.

On the next tab, the Attribute Characteristics view, you can see the probability of all attri-
butes as related to the predicted value output. In this example, the default value is set to a
state of 0, which means “does not buy a bicycle.” The default sort is by strongest to weak-
est correlation for all attributes. If you wish to adjust this view—for example, to sort within
attribute values—simply click the column header to re-sort the results. Figure 12-15 shows a
portion of the Attribute Characteristics view set to sort by Attributes. By sorting in this view,
you can easily see that a short commute distance correlates most strongly (of all attributes in
view) to the state of not purchasing a bicycle.
380 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 12-15 The Attribute Characteristics view for Naïve Bayes allows you to see attribute correlation in
several different sort views.

On the last tab, the Attribute Discrimination view, you can compare the correlations between
attributes that have two different states. Continuing our example, we see in Figure 12-16 that
the attribute value of owning 0 cars correlates much more significantly to the predicted state
value of buying a bicycle (Value 2 drop-down list set to 1) than to the state of not buying a
bicycle. You can further see that the next most correlated factor is the Age attribute, with a
value of 35-40. As with the Attribute Characteristics view, you can re-sort the results of the
Attribute Discrimination view by clicking any of the column headers in the view.

It is important to make sound decisions based on the strength of the correlations. To further
facilitate that using this view you can right-click any of the data bars and then click Show
Legend. This opens a new window that shows you the exact count of cases that support the
view produced. For example, opening the legend for the attribute value Number Of Cars
Owned shows the exact case (row) count to support all of the various attribute states: cars
owned = 0 or !=1, and bicycles purchased=0 or 1. These results are shown in a grid.

As previously mentioned, Naïve Bayes is a simple algorithm that we often use to get started
with data mining. The included views are easy to understand and we often show such results
directly to customers early in the data mining project life cycle so that they can better
Chapter 12 Understanding Data Mining Structures 381

understand their data and the possibilities of data mining in general. We turn next to a very
popular algorithm, Microsoft Decision Trees.

FiguRe 12-16 The Attribute Discrimination view for Naïve Bayes allows you to compare two states and their
associated attribute values.

Microsoft Decision Trees Algorithm


Microsoft Decision Trees is probably the most commonly used algorithm, in part because of
its flexibility—decision trees work with both discrete and continuous attributes—and also
because of the richness of its included viewers. It’s quite easy to understand the output via
these viewers. This algorithm is used to both view and to predict. It is also used (usually in
conjunction with the Microsoft Clustering algorithm) to find deviant values. The Microsoft
Decision Trees algorithm processes input data by splitting it into recursive (related) subsets.
In the default viewer, the output is shown as a recursive tree structure.

If you are using discrete data, the algorithm identifies the particular inputs that are most
closely correlated with particular predictable values, producing a result that shows which col-
umns are most strongly predictive of a selected attribute. If you are using continuous data,
the algorithm uses standard linear regression to determine where the splits in the decision
tree occur.

Figure 12-17 shows the Decision Tree view. Note that each node has a label to indicate the
value. Clicking a node displays detailed information in the Mining Legend window. You can
configure the view using the various drop-down lists at the top of the viewer, such as Tree,
Default Expansion, and so on. Finally, if you’ve enabled drillthrough on your model, you can
display the drillthrough information—either columns from the model or (new to SQL Server
2008) columns from the mining structure, whether or not they are included in this model.
The drillthrough result for Number Cars Owned = 0 is shown in the Drill Through window in
Figure 12-17.
382 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FiguRe 12-17 By adjusting the COMPLEXITY_PENALTY value, you can prune your decision tree, making the
tree easier to work with.

Microsoft Decision Trees is one of the algorithms that we have used most frequently when
implementing real-world data mining projects. Specifically, we’ve used it in business market-
ing scenarios to determine which attributes are more closely grouped to which results. We
have also used it in several law-enforcement scenarios to determine which traits or attributes
are most closely associated to offender behaviors.

When you are working with the Microsoft Decision Trees algorithm, the value of the results
can be improved if you pre-group data. You can do this using ETL processes or by using stan-
dard source data queries against the source data before you create a data mining structure.
Another important consideration when using this algorithm is to avoid overtraining your
model. You can impact this by adjusting the value of the COMPLEXITY_PENALTY parameter
in the Algorithm Parameters dialog box. By adjusting this number you can change the com-
plexity of your model, usually reducing the number of inputs to be considered and thereby
reducing the size of your decision tree. For example, a value of 0.5 produces 1 to 9 attributes,
while a value of 0.9 produces 10 to 99 attributes.

Another capacity of the Microsoft Decision Trees algorithm is to create multiple result trees.
If you set more than one source column as predictable (or if the input data contains a nested
table that is set to predictable), the algorithm builds a separate decision tree for each pre-
dictable source column. You can select which tree you’d like to view in the Decision Tree view
by selecting the tree you wish to view from the Tree drop-down list.
Chapter 12 Understanding Data Mining Structures 383

The Dependency Network view is also available for the Microsoft Decision Trees algorithm. It
looks and functions similarly to the way it does with the Naïve Bayes algorithm. As with Naïve
Bayes, you can remove less-related nodes to the selected node by adjusting the value of the
slider to the left of the view. This view was shown in Figure 12-8.

Tip If the result of using this algorithm hides some nodes you wish to include in your mining
model output, consider creating another mining model using the same source data, but with a
more flexible algorithm, such as Naïve Bayes. You’ll be able to see all nodes in your output.

Microsoft Linear Regression Algorithm


Microsoft Linear Regression is a variation of the Microsoft Decision Trees algorithm, and
works like classic linear regression—it fits the best possible straight line through a series of
points (the sources being at least two columns of continuous data). This algorithm calculates
all possible relationships between the attribute values and produces more complete results
than other (non–data mining) methods of applying linear regression. In addition to a key col-
umn, you can use only columns of the continuous numeric data type.

Another way to understand this is that it disables splits. You use this algorithm to be able to
visualize the relationship between two continuous attributes. For example, in a retail scenario,
you might want to create a trend line between physical placement locations in a retail store
and rate of sale for items. The algorithm result is similar to that produced by any other linear
regression method in that it produces a trend line. Unlike most other methods of calculating
linear regression, the Microsoft Linear Regression algorithm in SSAS calculates all possible
relationships between all input dataset values to produce its results. This differs from other
methods of calculating linear regression, which generally use progressive splitting techniques
between the source inputs.

The configurable parameters are maximum input (or output) and FORCE_REgRESSOR. This
algorithm is used to predict continuous attributes. When using this algorithm, you mark one
attribute as a regressor. The regressor attribute must be marked as a Continuous content
type. This attribute will be used as a key value in the regression formula. You can manually
set a source column as a regressor by using the FORCE_REgRESSOR parameter. Alternatively,
you can set the DMX REgRESSOR flag on your selected column.

Microsoft Time Series Algorithm


Microsoft Time Series is used to impact a common business problem, accurate forecasting.
This algorithm is often used to predict future values, such as rates of sale for a particular
product. Most often the inputs are continuous values. To use this algorithm, your source data
must contain at one column marked as Key Time. Any predictable columns must be of type
384 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Continuous. You can select one or more inputs as predictable columns when using this algo-
rithm. Time series source data can also contain an optional Key Sequence column.

New in SQL Server 2008 for the Microsoft Time Series algorithm is an additional algorithm
inside of the time series—the Auto Regressive Integrated Moving Average (ARIMA) algorithm.
It is used for long-term prediction. In SQL Server 2005 the Microsoft Time Series algorithm
used only Auto Regression Trees with Cross Predict (ARTxp), which is more effective at pre-
dicting the next step in a series for up to a maximum of 5-10 future steps. However, past that
point, ARIMA performs better.

Also new for 2008 is the ability to configure custom blending of the two types of time series
algorithms (this requires using the Enterprise edition). You do this by configuring a custom
value for the PREDICTION_SMOOTHINg parameter. In the Standard edition of SQL Server
2008 both types of time algorithms are used and the results are automatically blended by
default. In the Standard edition you can choose to use one of the two included algorithms,
rather than the default blended result. However, you cannot tune the blending variables (as
you can with the Enterprise edition). Note that the FORECAST_METHOD parameter shows
you which algorithms are being used.

When you are working with the Microsoft Time Series algorithm another important consider-
ation is the appropriate detection of seasonal patterns. You should understand the following
configurable parameters when considering seasonality:

■■ AUTO_DETECT_PERIODICITY Lowering the default value of 0.6 (the range is 0 to 1.0)


results in reducing the model training (processing) time because periodicity is detected
only for strongly periodic data.
■■ PERIODICITY_HINT Here you can provide multiple values to give the algorithm a hint
about the natural periodicity of the data. You can see in Figure 12-18 that the value 12
has been supplied. For example, if the periodicity of the data is yearly and quarterly, the
setting should be {3, 12}.

FiguRe 12-18 The FORECAST_METHOD parameter allows you to configure the type of time algorithm
(requires Enterprise edition).
Chapter 12 Understanding Data Mining Structures 385

The Microsoft Time Series Viewer helps you to understand the output of your model. It
shows the predicted values over the configured time series. You can also configure the num-
ber of prediction steps and whether you’d like to show deviations, as shown in Figure 12-19.
The default view using the sample Forecasting data mining structure (and Forecasting mining
model) shows only a subset of products in the Charts view. If you want to view the forecast
values for other bicycle models, simply select those models from the drop-down list on the
right side of the view. Note also that this sample model includes two predictable columns,
Amount and Quantity. So each output value (bicycle) has two predictive values.

FiguRe 12-19 The Charts view lets you see the selected, predicted values over the time series.

In our view we’ve selected the two predictive output values Amount and Quantity for a single
product (M200 Europe). We’ve also selected Show Deviations to add those values to our chart
view. (Deviations are the same thing as outliers.) When you pause your mouse on any point
on the output lines on the chart, a tooltip displays more detailed information about that
value.

The other view included for mining models built using the Microsoft Time Series algorithm
is the Model view. It looks a bit like the Decision Tree view used to view mining models
built with the Microsoft Decision Trees algorithm. However, although time series mod-
els in the Model view (Figure 12-20), have nodes (as shown in the Decision Tree view), the
Mining Legend window shows information related to this algorithm—namely coefficients,
386 Part II Microsoft SQL Server 2008 Analysis Services for Developers

histograms, and tree node equations. You have access to this advanced information so that
you can better understand the method by which the predictions have been made.

FiguRe 12-20 The Model view allows you see the coefficient, histogram, and equation values for each node
in the result set.

Closer examination of the values in the Mining Legend window shows that in addition to the
actual equations, a label lists which of the two possible time-based algorithms (ARIMA or
ARTxp) was used to perform this particular calculation.

Microsoft Clustering Algorithm


As its name indicates, the Microsoft Clustering algorithm focuses on showing you meaningful
groupings in your source data. Unlike Naïve Bayes, which requires discrete content inputs and
considers all input attributes of equal weight, Microsoft Clustering allows for more flexibility
in input types and grouping methodologies. You can use more content type as input and you
can configure the method used to create the groups or clusters. We’ll dive into more detail in
the next section.

Microsoft Clustering separates your data into intelligent groupings. As we mentioned in


the previous paragraph, you can use Continuous, Discrete, and most other content types.
You can optionally supply a predictive value, by marking it as predict only. Be aware that
Chapter 12 Understanding Data Mining Structures 387

Microsoft Clustering is generally not used for prediction—you use it to find natural group-
ings in your data.

When using the Microsoft Clustering algorithm it is important for you to understand the
types of clustering that are available to you, which are called hard or soft. You can configure
the CLUSTERINg_METHOD parameter using the properties available for this algorithm, as
shown in Figure 12-21. The choices are Scalable EM (Expectation Maximization), Non-Scalable
EM, Scalable K-Means, or Non-Scalable K-Means. The default is Scalable EM, mostly for its
lightweight performance overhead. K-type clustering is considered hard clustering because it
creates buckets (groupings) and then assigns your data into only one bucket with no overlap.
EM clustering takes the opposite approach—overlaps are allowed. This type is sometimes
called soft clustering. The scalable portion of the selected method of clustering refers to
whether a subset of the source data or the entire set of source data is used to process the
algorithm. For example, a maximum size of 50,000 source rows is used to initially process
Scalable EM. If that value is sufficient for the algorithm to produce meaningful results, other
data rows greater than 50,000 are ignored. In contrast, Non-Scalable EM loads the entire
dataset at initial process. Performance can be up to three times faster for Scalable EM than
for Non-Scalable EM. Scalability works identically for K-Means, meaning that the Scalable
version of K-Means loads the first 50,000 rows and loads subsequent rows only if meaningful
results are not produced by the first run of the algorithm.

You can use the CLUSTERINg_METHOD parameter to adjust the clustering method used. In
Figure 12-21, notice that the default value is 1. The possible configuration values are 1–4: 1,
Scalable EM; 2, Non-Scalable EM; 3, Scalable K-Means; 4, Non-Scalable K-Means.

FiguRe 12-21 You can choose from four different methods of implementing the clustering algorithm.
Configure your choice by using the CLUSTERINg_METHOD parameter.

After you’ve created and tuned your model, you can use the Microsoft Cluster Viewer to
better understand the clusters that have been created. The four types of Microsoft Cluster
views are: Cluster Diagram, Cluster Profiles, Cluster Characteristics, and Cluster Discrimination.
Looking first at the Cluster Diagram view, you’ll notice that the default shading variable is set
to the entire population. Because of this, the State value is not available. Usually we adjust
the Shading Variable value to something other than the default, such as Bike Buyer, which is
388 Part II Microsoft SQL Server 2008 Analysis Services for Developers

shown in Figure 12-22. We’ve also set the State value to 1. This way we can use the results of
the Microsoft Clustering algorithm to understand the characteristics associated with bicycle
purchasers. We also show the result of Drill Through on Cluster 1. This time we limited our
results to model columns only.

FiguRe 12-22 The Cluster Diagram view shows view information about variables and clusters, including Drill
Through.

As you take a closer look at this view, you’ll see that you can get more information about
each cluster’s characteristics by hovering over it. You can rename clusters by right-clicking
them. It is common to rename cluster nodes to help with the usability of the view. For
example, you might use something like Favors Water Bottles as a cluster node name. You can
view drillthrough information for a cluster by right-clicking it and then clicking Drill Through.
Figure 12-22 shows the Cluster Diagram view with the Drill Through results for Cluster 1.
Notice the (now familiar) dependency strength slider to the left of this view.

The other three views—Cluster Profiles, Cluster Characteristics, and Cluster Discrimination—
are quite similar to those with the same names that are available to view the results of the
Naïve Bayes algorithm (Attribute Profiles, for example). Do you understand the key differ-
ences between Naïve Bayes and Microsoft Clustering? To reiterate: For Microsoft Clustering
you have more flexibility in source data types and in configuring the method of grouping
(EM or K-Means and Scalable or Non-Scalable). For these reasons Microsoft Clustering is used
for similar types of tasks, such as finding groupings in source data, but it is more often used
later in data mining project life cycles than Naïve Bayes.
Chapter 12 Understanding Data Mining Structures 389

Tip Microsoft Clustering is sometimes used to prepare or investigate source data before imple-
menting Microsoft Decision Trees. The results of clustering (clusters) are also sometimes used
as input to separate Microsoft Decision Trees models. This is because this type of pre-separated
input reduces the size of the resulting decision tree, making it more readable and useful.

Microsoft Sequence Clustering


Microsoft Sequence Clustering results in a similar output as (regular) Microsoft Clustering, with
one important addition: It monitors the states between values. In other words, it detects clus-
ters, but clusters of a particular type—clusters of sequenced data. To implement this algo-
rithm you must mark at least one source column with the Key Sequence content type. This
key sequence must also be in a nested table. If your source data structure does not include
appropriately typed source data, this algorithm type will not be available in the drop-down
list when you create or add new mining models. One example is click-stream analysis of navi-
gation through various Web pages on a Web site. (Click-stream analysis refers to which pages
were accessed in what order by visitors to one or more Web sites.)

The Microsoft Sequence Clustering algorithm uses the EM (Expectation Maximization)


method of clustering. Rather than just counting to find associations, this algorithm deter-
mines the distance between all possible sequences in the source data. It uses this information
to create cluster results that show sequences ranked.

An interesting configurable parameter for this algorithm is CLUSTER_COUNT, which allows


you to set the number of clusters that the algorithm builds. The default is 10. Figure 12-23
shows this property. Another parameter is MAXIMUM_SEQUENCE_STATES. If you are using
the Enterprise edition of SQL Server 2008, you can adjust the default of 64 to any number
between 2 and 65,535 (shown in the Range column in Figure 12-23) to configure the maxi-
mum number of sequence states that will be output by the algorithm.

FiguRe 12-23 You can adjust the maximum number of sequence states if using the Enterprise edition of SQL
Server 2008.

Five different views are included as part of the Microsoft Sequence Cluster Viewer for you
to use to see the output of the Microsoft Sequence Clustering algorithm: Cluster Diagram,
Cluster Profiles, Cluster Characteristics, Cluster Discrimination, and State Transitions. The
first four views function quite similarly to those found in the Microsoft Cluster Viewer. Using
390 Part II Microsoft SQL Server 2008 Analysis Services for Developers

the fifth view, State Transitions, you can look at the state transitions for any selected cluster.
Each square (or node) represents a state of the model, such as Water Bottle. Lines represent
the transition between states, and each node is based on the probability of a transition. The
background color represents the frequency of the node in the cluster. As with the other clus-
ter views, the default display is for the entire population. You can adjust this view, as we have
in Figure 12-24, by selecting a particular cluster value (in our case, Cluster 1) from the drop-
down list. The number displayed next to the node represents the probability of affecting the
associated node.

FiguRe 12-24 The State Transitions view helps you visualize the transitions between states for each cluster in
your model.

The next algorithm we’ll examine is second in popularity only to the Microsoft Time Series
algorithm. This is the Microsoft Association algorithm. It is used to analyze groups of items,
called itemsets, which show associative properties. In our real-world experience with imple-
menting data mining solutions, we’ve actually used this algorithm more than any of the other
eight available.
Chapter 12 Understanding Data Mining Structures 391

Microsoft Association Algorithm


As we just mentioned, Microsoft Association produces itemsets, or groups of related items
from the source attribute columns. It creates and assigns rules to these itemsets. These rules
rank the probability of the particular itemsets to be together. This is often called market-bas-
ket (or shopping basket) analysis. Source data for Microsoft Association takes the format of a
case and a nested table. Source data can have only one predictable value. Typically this is the
key column of the nested table. All input columns must be of type Discrete.

If you are impressed with the power of this algorithm, you are not alone. As mentioned,
we’ve used this one with almost every customer who has implemented data mining as part
of their BI solution. From brick-and-mortar retailers who want to discover which items when
physically placed together would result in better sales, to online merchants who want to sug-
gest you might like x product because you bought y, we use this algorithm very frequently
with customers who want to improve rates of sale of their products. Of course this isn’t the
only use for this algorithm, but it is the business scenario we more frequently use it for.

One consideration in using Microsoft Association is that its power doesn’t come cheaply.
Processing source data and discovering itemsets and rules between them is computationally
expensive. Change default parameter values with care when using this algorithm.

Microsoft Association has several configurable parameters. As with some of the other algo-
rithms, the ability to adjust some of these parameters depends on the edition of SQL Server
that you are using. Several values require the Enterprise edition of SQL Server. Included in the
configurable parameters is the ability for you to adjust the maximum size of the discovered
itemsets. This parameter is set to a value of three by default. These parameters are shown in
Figure 12-25.

FiguRe 12-25 You can adjust the size of the itemsets via the MAXIMUM_ITEMSET_SIZE parameter when
using the Microsoft Association algorithm.

Three types of views are available in the Microsoft Association Rules Viewer for you to review
the results of the Microsoft Association algorithm: the Rules, Itemsets, and Dependency
Network views. The Rules view shows a list of the rules that have been generated by the
algorithm. These rules are displayed by default in order of probability of occurrence. You can,
392 Part II Microsoft SQL Server 2008 Analysis Services for Developers

of course, change the sort order by clicking any of the column headers in this view. You can
also adjust the view to show rules with different minimum probabilities or minimum impor-
tance values than the configured defaults display by changing the values of these controls.

The Itemsets view allows you to take a closer look at the itemsets that the algorithm has
discovered. We’ve changed a couple of the default view settings in Figure 12-26, so that you
can more easily understand the data. We’ve selected Show Attribute Name Only in the Show
drop-down list. Next we clicked the Size column to show a sort order that was supported
by the number of items in the itemset. You can see that the top single-item itemset is the
Sport-100 with 6171 cases. As you can do with the Rules view, you can also adjust the mini-
mum support in this view. As well, you can adjust the minimum itemset size and maximum
rows displayed.

FiguRe 12-26 The Itemsets view allows you to see the itemsets that were discovered by the Microsoft
Association algorithm.

You can also manually type a filter to the Rules or the Itemsets view. To do this, type the filter
value in the Filter Itemset box. An example for this particular sample would be Mountain-200
= Existing. After you type a filter, press Enter to refresh the view and see the filter applied.

Although we find value in both the Rules and Itemsets views, we most often use the
Dependency Network view to better understand the results of the Microsoft Association
algorithm. This view shows relationships between items and allows you to adjust the view by
Chapter 12 Understanding Data Mining Structures 393

dragging the slider to the left of the view up or down. After you click a particular node, color
coding indicates whether the node is predicted (by another node) or predictive (of another
node). This view is shown in Figure 12-27.

FiguRe 12-27 The Dependency Network view for the Microsoft Association algorithm allows you to review
itemsets from the results of your mining model.

Tip The SQL Server 2008 Data Mining Add-ins for Excel 2007 include a new capability for
working with the Microsoft Association algorithm using the Table Analysis Tools on the Excel
2007 Ribbon. This algorithm is used by the new Shopping Basket Analysis functionality avail-
able with these tools. The Shopping Basket Analysis button generates two new types of reports
(views) for the output produced: the Shopping Basket Bundled Item and Shopping Basket
Recommendations. We will review this in more detail in Chapter 13.

The next algorithm we look at is the most powerful and complex. We haven’t often used it
in production situations because of these factors. However, we are familiar with enterprise
customers who do need the power of the Microsoft Neural Network algorithm, so we’ll take
a closer look at that next. This algorithm also contains the Microsoft Logistic Regression func-
tionality, so we’ll actually be taking at look at both algorithms in the next section.
394 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Microsoft Neural Network Algorithm


Microsoft Neural Network is by far the most powerful and complex algorithm. To glimpse
the complexity, you can simply take a look at the SQL Server Books Online description of the
algorithm: “This algorithm creates classification and regression mining models by constructing
a Multilayer Perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm,
the Microsoft Neural Network algorithm calculates probabilities for each possible state of
the input attribute when given each state of the predictable attribute. You can later use these
probabilities to predict an outcome of the predicted attribute, based on the input attributes.” If
you’re thinking “Huh?,” you’re not alone.

When do you use this algorithm? It is recommended for use when other algorithms fail to
produce meaningful results, such as those measured by a lift chart output. We often use
Microsoft Neural Network as a kind of a last resort, when dealing with large and complex
datasets that fail to produce meaningful results when processed using other algorithms. This
algorithm can accept a data type of Discrete or Continuous as input. For a fuller understand-
ing of how these types are processed, see the SQL Server Books Online topic “Microsoft
Neural Network Algorithm Technical Reference.” Using Microsoft Neural Network against
large data sources should always be well-tested using near-production-level loads because
of the amount of overhead needed to process these types of models. As with the other
algorithms, several parameters are configurable using the Algorithm Parameters dialog box
in SQL Server 2008. As with some of the other algorithms that require extensive processing
overhead, you should change the default values only if you have a business reason to do so.
These parameters are shown in Figure 12-28.

FiguRe 12-28 You can change the HIDDEN_NODE_RATIO default value of 4.0 if you are working with the
Enterprise edition of SQL Server 2008.

The Microsoft Neural Network Viewer contains only one view for this algorithm. The view,
shown in Figure 12-29, does allow you add input attribute filters as well as adjust the out-
put attribute that is shown. For this figure, we’ve added three input attribute filters related
to Commute Distance, Education, and Occupation. When you pause your mouse on the
bars you’ll see a tooltip with more detailed information about the score, probability, and lift
related to the selected value.
Chapter 12 Understanding Data Mining Structures 395

FiguRe 12-29 Only one viewer is specific to the Microsoft Neural Network algorithm.

A variant of the Microsoft Neural Network algorithm is the Microsoft Logistic Regression
algorithm. We’ll take a look at how it works next.

Microsoft Logistic Regression


Microsoft Logistic Regression is a variant of the Microsoft Neural Network algorithm. (Here
the HIDDEN_NODE_RATIO parameter is set to 0.) Microsoft Logistic Regression uses a vari-
ant of linear regression; one example is when the dependent variable is a dichotomy, such
as success/failure. It is interesting to note that the configurable parameters are identical to
those of the Microsoft Neural Network algorithm, with the exception of the removal of the
HIDDEN_NODE_RATIO value, which will always be set to 0 in this case. Figure 12-30 shows
the configurable parameters for this algorithm.

FiguRe 12-30 The Microsoft Logistic Regression algorithm contains six configurable parameters.

Because this algorithm is a variant of Microsoft Neural Network, the viewer for it is the same
as the one for that algorithm.
396 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Tip The SQL Server 2008 Data Mining Add-Ins for Excel 2007 include a new capability for work-
ing with the Microsoft Logistic Regression algorithm using the Table Analysis Tools on the Excel
2007 Ribbon. It generates output that can be used offline to calculate predictions using infor-
mation from the processed data mining model. This algorithm is used by the new Prediction
Calculator functionality available with these tools. This activity is described in more detail in
Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007.”

The art of Data Mining


Feeling overwhelmed? As you come to understand the amount of power included out of the
box in SSAS data mining, that is not an uncommon reaction. If you are completely new to
data mining, remember our earlier tip: Rather than starting with BIDS, use the Table Analysis
Tools and Data Mining tab on the Excel 2007 Ribbon to familiarize yourself with the function-
ality of the algorithms. We devote an entire chapter to this very topic (Chapter 23) because
of the importance of using these Ribbon interfaces not only for end users, but also for BI
developers. We’ve often found that the best way to get a feel for exactly what each of these
algorithms does is to use each on a small set of sample data, which can simply originate in
Excel. Selecting the appropriate algorithm is a key part of successfully implementing SSAS
data mining.

One key difference between designing OLAP cubes and data mining structures and models is
that the latter are much more pliable. It is common to tinker when implementing data mining
models by trying out different algorithms, adjusting the dataset used, and tuning the algo-
rithm by adjusting the parameter values. Of course part of this development cycle includes
using the mining model validation visualizers and then continuing to fiddle with the models
based on the results of the validators. This is not to imply a lack of precision in the algo-
rithms, but to remember the original purpose of using data mining—to gain understanding
about relationships and patterns in data that is unfamiliar to you.

Summary
We’ve completed a deep dive into the world of data mining algorithm capabilities. In this
chapter we discussed preparatory steps for implementing SQL Server 2008 data mining.
These steps included understanding content types and usage attributes for source data, as
well as data mining concepts such as case and nested tables. We then took a closer look at
the BIDS interface for data mining, exploring each of the tabs. Next we discussed how each
of the nine included algorithms works—what they do, what they are used for, and how to
adjust some default values.
Chapter 12 Understanding Data Mining Structures 397

Because we’ve covered so much ground already, we are going to break implementation
information into a separate chapter. So, take a break (you’ve earned it!), and after you are
refreshed, join us in the next chapter, where we implement the CRISP-DM SDLC (otherwise
known as a proven software development life cycle for implementing BI projects using data
mining technologies in BIDS. This includes creating, validating, tuning, and deploying data
mining models. We’ll also examine DMX and take a look at data mining capabilities in SSIS.
Chapter 13
Implementing Data Mining
Structures
In this chapter, we implement the Cross Industry Standard Process for Data Mining
(CRISP-DM) life cycle to design, develop, build, and deploy data mining structures and mod-
els using the tools provided in Microsoft SQL Server 2008 Analysis Services (SSAS). We focus
on using the Business Intelligence Development Studio (BIDS) to create mining structures
and models. After we create a model or two, we explore the functionality built into BIDS to
validate and query those models, take a look at the Data Mining Extensions (DMX) language,
and review data mining integration with Microsoft SQL Server 2008 Integration Services
(SSIS). To complete our look at data mining, we discuss concerns related to model mainte-
nance and security.

Implementing the CRISP-DM Life Cycle Model


In Chapter 12, “Understanding Data Mining Structures,” we looked at the data mining algo-
rithms included in SSAS. This chapter focuses on the implementation of these algorithms in
business intelligence (BI) projects by using specific data mining models. We also introduce a
software development life cycle model that we use for data mining projects: the CRISP-DM
life cycle model. As we move beyond understanding its algorithms and data mining capabili-
ties to building and validating actual data mining models, we refer to the phases of the CRISP
model, which are shown in Figure 13-1.

Note Some Microsoft data mining products, notably the Data Mining tab of the Ribbon in
Microsoft Office Excel 2007, have been built specifically to support the phases that make up the
CRISP-DM model. We talk more about the integration of Excel with Analysis Services data mining
in Part IV. In this chapter, the focus is on working with mining models within BIDS. We also cover
integration with Microsoft SQL Server 2008 Reporting Services, Excel, and other client tools in
Part IV.

As previously mentioned, we’ve done the work already (in Chapter 12) to understand and
implement the first three phases of the CRISP model. There we covered business understand-
ing by discussing the particular capabilities of the nine included data mining algorithms as
they relate to business problems. To review briefly, the nine included data mining algorithms
are as follows:

■■ Microsoft Naïve Bayes algorithm A very general algorithm, often used first to make
sense of data groupings
399
400 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ Microsoft Association algorithm Used for market basket analysis—that is, a “what
goes with what” analysis—or for producing itemsets
■■ Microsoft Sequence Clustering algorithm Used for sequence analysis, such as Web
site navigation click-stream analysis
■■ Microsoft Time Series algorithm Used to forecast future values over a certain time
period
■■ Microsoft Neural Network algorithm A very computationally intense algorithm; used
last, when other algorithms fail to produce meaningful results
■■ Microsoft Logistic Regression algorithm A more flexible algorithm than spreadsheet
logistic regression; used as an alternative to that
■■ Microsoft Decision Trees algorithm Used in decision support; the most frequently
used algorithm
■■ Microsoft Linear Regression algorithm A variation of decision trees (See Chapter 12
for details.)
■■ Microsoft Clustering algorithm A general grouping; more specific than Naïve Bayes;
used to find groups of related attributes

Business Data
Understanding Understanding

Data
Preparation

Deployment

Modeling

Data

Evaluation

FIguRe 13-1 Phases of the CRISP-DM process model, the standard life cycle model for data mining

Data understanding and data preparation go hand in hand, and they can be performed only
after a particular algorithm has been selected. You’ll recall that this is because algorithms
have particular requirements related to source data, such as use of Naïve Bayes requiring that
Chapter 13 Implementing Data Mining Structures 401

all source data be of content type discrete. Data preparation also includes extract, transform,
and load (ETL) processes to clean, validate, and sometimes also pre-aggregate source data.

We’re frequently asked, “Should I create OLAP cubes first, and then implement data mining
using data from those cubes? Or is data mining best implemented using relational source
data?” Although there’s no single right answer to these questions, we generally prefer to
implement data mining after OLAP cubes have been created. The simple and practical rea-
son for this is that, in our experience, most relational source data is in need of cleansing and
validation prior to it being used for any type of BI task. Also OLAP cubes provide aggregated
data. Loading smaller amounts of cleaner data generally produces more meaningful results.

Because we already reviewed the nine mining algorithms in detail in Chapter 12 and you
understand the preparatory steps, you’re ready to create mining models. You can do this
inside of a previously created mining structure, or you can create a new, empty mining struc-
ture in BIDS. Keep in mind that to create a mining structure, at a minimum, you first have
to define a data source and a data source view. You can also choose to build an OLAP cube,
but this last step is not required. After you’ve completed the preparatory steps just listed,
right-click on the Data Mining folder in Solution Explorer in BIDS, and choose New Mining
Structure. Doing this opens the Data Mining Wizard, which allows you to create a data
mining structure with no mining models (if you don’t have an existing structure you want to
work with).

In the following sections, we start by creating an empty mining structure, and add a couple
of mining models to that structure.

Building Data Mining Structures using BIDS


Creating an empty data mining structure might seem like a waste of time. However, we find
this method of working to be efficient. As you’re aware of by now, source data for data min-
ing models requires particular structures (that is, data, content, and usage types), and these
structures vary by algorithm. Creating the structure first helps us to focus on including the
appropriate source data and content types. Also, you might remember that drillthrough can
be implemented against all columns in the structure whether or not they are included in the
model. Creating a structure first and then defining subsets of attributes (columns) included in
the structure results in more compact and effective data mining models.

To create a data mining structure, open BIDS and create a new Analysis Services project.
Create a data source and a data source view. Then right-click on the Data Mining Structures
folder in the Solution Explorer window in BIDS. As you work through the pages in the Data
Mining Wizard to create an empty data mining structure, consider these questions:

■■ What type of source data will you include: relational (from a data source view) or multi-
dimensional (from a cube in your project)? Using this wizard, you can’t combine the
two types of source data; you must choose one or the other. You can, of course, include
402 Part II Microsoft SQL Server 2008 Analysis Services for Developers

more than one table or dimension from the particular selected source. In this chapter,
as in the others, we’ll use sample data from the SQL Server 2008 sample databases or
OLAP cubes.
■■ If you’re creating an empty data structure, indicate that on the second page of the wiz-
ard. Otherwise, you must select one algorithm that will be used to create a data mining
model in your new data structure. As mentioned, at this point we’ll just create an empty
structure, and later we’ll add a couple of mining models to it.
■■ If you choose relational source data, you must decide which table will be the case table.
Will you include a nested table? Which columns will you include? Or, if you choose mul-
tidimensional data, which dimensions, attributes, and facts will you include? During this
part of the wizard, the key column for each source table is detected. You can adjust the
automatically detected value if needed.
■■ What are the content and data types of the included columns or attributes? BIDS
attempts to automatically detect this information as well. When you’re working in
the pages of the wizard, you can make changes to the values that BIDS supplies. You
can also rename source columns or attributes. Figure 13-2 shows this page. You’ll see
that the values available for content types on this page are Continuous, Discrete, or
Discretized for nonkey columns. For key columns, the content type choices are Key, Key
Sequence, or Key Time. As we discussed in Chapter 12, this is a subset of all available
content types. To mark a source column (or attribute) with a content type not available
in the wizard, you change the value in the property sheet for that attribute (after you
complete the wizard).

FIguRe 13-2 The BIDS Data Mining Wizard detects content and data types in your source data.
Chapter 13 Implementing Data Mining Structures 403

The data mining data types—Text, Long, Boolean, Double, and Date—are also auto-
matically detected by the wizard in BIDS.
■■ How much test or holdout training data would you like to use? New for 2008 is the
ability to create holdout test sets during structure or model creation via the wizard. This
capacity allows a validation function to determine the predictive accuracy of a model
by using the training sets for training the model and making the predictions against
the test sets. Usually, around 30 percent (default value) of the source data is held out
(meaning not included in the training set). You can choose to have this configured auto-
matically, to configure it via the Data Mining Wizard, or to configure it programmati-
cally using DMX, Analysis Management Objects (AMO), or XML DDL. Figure 13-3 shows
the page in the Data Mining Wizard where you can configure a portion of your source
data to be created as a testing set.

FIguRe 13-3 The BIDS Data Mining Wizard allows you to easily create testing datasets.

There are two ways you can specify a training set using this wizard. You can accept (or
change) the default percentage value or you can fill in a value for a maximum number of
cases in the testing dataset. If you fill in both values, the value that produces the smaller
result is used to create the actual test set. You can alternatively configure the values used
to produce test sets in the properties of the data mining structure. These properties are
HoldoutMaxCases, HoldoutMaxPercent, and HoldoutSeed. The last property mentioned is
used to configure comparable size test sets when multiple models are created using the
same data source. Note that the Microsoft Time Series algorithm does not support automatic
creation of test sets.
404 Part II Microsoft SQL Server 2008 Analysis Services for Developers

This is a welcome new feature, particularly because it works not only with simple (single table
or dimension) sources, but also with structures that contain nested tables.

You can just click through and name your new data mining structure. If you make a selection
in the wizard that results in an error during mining structure processing, the wizard prompts
you with an informational dialog box. The error messages are specific and thoughtfully writ-
ten. This is yet another way that Microsoft has made it easier for you to successfully build
mining structures, even if you are a novice.

The flexibility of the Data Mining Wizard and the resulting data mining structure is quite con-
venient. As we’ve mentioned repeatedly, it’s quite common to adjust content types, add or
remove attributes from a structure, and otherwise tune and tinker as the data mining process
continues. Of course, if you attempt to create either an invalid structure or model, you can
simply click the Back button in the wizard, fix the error, and then attempt to save the new
object again.

The next step in the process after creating a data mining structure is to add one or more
mining models to it. We’ll do that next.

Adding Data Mining Models using BIDS


After you’ve created a data mining structure, you’ll want to add one or more mining models
to it. The steps to do this are quite trivial, as long as you thoroughly understand which data
mining algorithm to select and what the required structure of the source data is for the algo-
rithm. You can right-click inside the designer surface in BIDS when the Mining Structure or
Mining Models tab is selected and then click New Mining Model. You simply provide a model
name and select the algorithm. BIDS then adds this model to your mining structure.

After you’ve added a mining model based on a particular algorithm, you might want to mod-
ify the usage of the attributes. The values are auto-generated by BIDS during data mining
model creation, but you can easily change them by selecting a new value in the drop-down
list of the Mining Model designer. The values you can select from are Ignore, Input, Key,
Predict, or PredictOnly.

You might also want to configure some of the properties that are common to all types of
data mining models or some of the algorithm parameters to further customize your mining
model. We covered many of the algorithm-specific parameters algorithm by algorithm in the
previous chapter. Common properties for mining models include the object name, whether
or not drillthrough is allowed, and the collation settings. You access these properties by right-
clicking the mining model on the Mining Models tab in BIDS and then clicking Properties.
This opens the Properties window.

In addition to configuring properties and parameters, you might want to reduce the size of
your model by specifying a filter. This is a new feature introduced in SQL Server 2008. You
can do this on the Mining Models tab by right-clicking the model you want to filter and then
Chapter 13 Implementing Data Mining Structures 405

clicking Set Model Filter. This opens the dialog box shown in Figure 13-4. There you can
specify a query to filter (or partition) your mining source data. An example of a business use
of this feature is to segment source data based on age groupings.

FIguRe 13-4 New in SQL Server 2008 is the ability to filter source data for data mining models using BIDS.

Using filters allows you to create smaller mining models. These smaller models can be pro-
cessed more quickly and can produce understandable and meaningful results more quickly
for your business situation. You can also configure filters by using the Properties window for
your mining model.

Although creating most types of mining models is straightforward, some specific types of
source data require particular considerations. One example of this situation occurs if you use
multidimensional data as source data rather than using relational tables as source data. If you
select From Existing Cube as your data source for your mining structure, you are presented
with one additional page before you complete the Data Mining Wizard. This page allows
you to define a particular slice (or subset) of the cube as the basis for your mining structure.
Figure 13-5 shows an example of defining a slice in this way after selecting the Show Only
Customers Who Are Home Owners option in the Data Mining Wizard.

FIguRe 13-5 The Data Mining Wizard allows you to define slices of the source cube.
406 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Another example of a requirement specific to a particular source type is when you’re creating
a model based on an existing cube that contains a time series using the Data Mining Wizard.
If this is your situation, you need to slice the time dimension to remove all members that fall
outside a certain range of time (assuming those members have been loaded into your cube
already).

Still another difference in the wizard if you base your mining model on an existing cube
(rather than relational source data) is that on the final page, shown in Figure 13-6, you’re
presented with two additional options: Create Mining Model Dimension and Create Cube
Using Mining Model Dimension. If you select either of these last two options, you also need
to confirm the default name for the new dimension or cube, or you can update that value to
name the new cube or dimension as your business requirements dictate. The default naming
conventions are <DimensionName>_DMDim and <CubeName>_DM.

FIguRe 13-6 The Data Mining Wizard allows you to create a new dimension in your existing cube using the
information in your mining model, or you can create an entirely new cube.

One interesting aspect of creating a new dimension, whether in the existing cube or in a
new cube, is that a new data source view (DSV) is created representing the information in
the model. Unlike most typical DSVs, this one cannot be edited, nor can the source data (the
model output) be viewed via the DSV interface.
Chapter 13 Implementing Data Mining Structures 407

Processing Mining Models


After you’ve created a data mining structure that includes one or more data mining models,
you’ll want to use the included data mining viewers to look at the results of your work. Before
you can do this, however, you have to process the data mining objects. As with multidimen-
sional structure (OLAP cube) processing, your options for processing depend first on whether
you’re working in connected or disconnected mode. If you’re building your initial structures
and models and are working in disconnected mode, you’ll be presented with the following
options for creating your objects: Process, Build, Rebuild, and Deploy. These options function
nearly the same way as they do when you’re using them to work with OLAP cubes. That is,
Build (or Rebuild) simply validates the XMLA that you’ve created using the visual design tools.

There is one difference between data mining models and OLAP cubes when using Build,
though—that is, there are no AMO design warnings built in for the former. If you’ve created
a structure that violates rules, such as a case table with no key column, you receive an error
when you attempt to build it. This violation is shown using a red squiggly line and the build
will be unsuccessful.

After you’ve successfully built your data mining objects, you can process them. This copies all
XMLA to the SSAS server that will create the data mining structures. After this step is com-
plete, the data mining objects are populated with source data from the data source defined
in the project. Process progress is shown in a dialog box of the same name. An example is
shown in Figure 13-7.

FIguRe 13-7 The Process Progress dialog box shows you step-by-step process activities for data mining
structures or models.
408 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Again, as when you’re building OLAP cubes, during the early development phases, you’ll tend
to build, process, and deploy data mining objects using the default processing setting. This
setting, Process Full, completely erases all existing data mining objects with the same name
and completely rebuilds them and repopulates them with source data. Although this tech-
nique is quick and easy during development, as you move your mining models into produc-
tion, you’ll want to use more granular methods to process models, particularly when you’re
simply adding new data to an existing mining model. Processing methods for the structure
are full, default, structure, clear structure, and unprocess. Processing methods for the model
are full, default, and unprocess. We’ll cover the methods available to you to do such granular
processing later in this chapter.

After you’ve processed and deployed your mining models to the server, you can use the
Mining Model Viewer tab to look at your mining model results. We spent quite a bit of time
reviewing the various mining model viewers in the previous chapter, and we’re not going to
repeat that information here—other than to remind you that we use the mining model view-
ers as a first step in validating our data mining models.

Another consideration regarding data mining object processing is error handling. You can
configure several error-handling properties for each data mining structure by using the avail-
able settings in the Properties dialog box. To make these properties visible, you must change
the ErrorConfiguration value from (default) to (custom). Figure 13-8 shows these values. They
include many configurable properties relating to the handling of key and null errors. You’ll
recognize most of these from the dimension and fact custom error-handling configuration
possibilities, because, in fact, these properties are nearly the same. Also, we offer the same
advice here as we did when working with data destined for OLAP cubes—that is, refrain from
loading nulls and key errors by performing thorough ETL processes prior to attempting to
load your data mining models.

FIguRe 13-8 Custom error handling can be defined for each mining structure.

After you move to production, you might want to automate the processing of your min-
ing models or structures. As with OLAP object processing, you can easily accomplish this by
creating an SSIS package and choosing the Analysis Services Processing Task option. We’ll
take a closer look at the integration between SSIS and SSAS data mining later in this chapter.
Chapter 13 Implementing Data Mining Structures 409

Now, however, we’re going to return to the logical next step in the data mining system devel-
opment life cycle—that is, model validation. To accomplish this task, you can start by using
the tools included in BIDS. We’ll start by exploring the Mining Accuracy Chart tab in BIDS.

Note In addition to using XMLA as a definition language for data mining objects, there is
another data definition language available for creating some types of data mining objects. This
language is called Predictive Model Markup Language (PMML). It’s a standard developed by a
working group of data mining vendors, including Microsoft. See SQL Server Books Online to
understand which types of algorithms are supported.

Validating Mining Models


Although some of our clients have validated the usefulness of their data mining models sim-
ply by using the viewers available on the Mining Model Viewer tab to help them understand
the results of the mining model algorithm, most consumers of data mining projects also want
to include an accuracy check of the models that are produced. This is done to validate the
output produced and to guide further customization of a particular model—or even to help
determine which algorithms shed the most light on the particular business situations being
addressed. You should be aware that in the CRISP-DM software development life cycle, there
is a distinct phase of the software development life cycle dedicated to model evaluation
because of the very large number of variables involved in successful data mining. These vari-
ables include the choice of data for inclusion, content types, inputs, predictability, choice of
mining algorithm, and so on. We’ve mentioned this idea several times previously, but it bears
repeating here. We find this inexactitude to be challenging for developers new to data min-
ing to understand, accept, and work with. Because of the wide variety of choices involved in
data mining, model validation is a key phase in all but the simplest data mining projects.

Let’s take a look at the types of tools that are included in BIDS to assist with model valida-
tion. Each is represented by an embedded tab in the Mining Accuracy Chart tab. They are
the lift (or profit) chart, classification matrix, and the cross validation tool. Also on the Mining
Accuracy Chart tab is the preparatory Input Selection tab. This interface is shown in Figure
13-9. When you configure the test dataset and click on any of the accuracy chart tabs, a DMX
query is generated and the results are shown in graphical or tabular form. For example, in
the case of a lift chart, the DMX query begins with CALL System.Microsoft.AnalysisServices.
System.DataMining.AllOther.GenerateLiftTableUsingDataSource.

To use any of these validator tools, you must first configure the input. You do this by select-
ing the mining model or models from the selected structure that you want to analyze, and
then selecting a predictable column name and (optionally) a predictive value setting. The
next portion of the input configuration involves selecting what the source of the test data
will be. Test data contains the right answer (or correct value for the predictions) and is used
410 Part II Microsoft SQL Server 2008 Analysis Services for Developers

to score the model during accuracy analysis. The default setting is to use the test dataset
that you (optionally, automatically) created at the time of data mining structure creation. At
the bottom of the Input Selection tab, you can change that default to use any dataset that
is accessible through a data source view in your BIDS project as testing data. You can also
optionally define a filter for manually selected testing datasets. After you’ve configured input
values, you click on the Lift Chart tab to create a lift chart validation of your selected model
or models.

FIguRe 13-9 The Input Selection tab is used to select the test data to use when verifying model accuracy.

To further explain the outputs of the Mining Accuracy Chart tab, you need to understand a
bit more about lift charts and profit charts.

Lift Charts
A lift chart compares the accuracy either of all predictions or of a specific prediction value
of each model included in your structure to that of an average guess and also to that of a
perfect prediction. Lift chart functionality is included in BIDS. The output of lift charts is a
line chart. This chart includes a line for average guessing, or 50 percent correct values, and a
line for ideal results or 100 percent correct results. The average guess line bisects the chart’s
center; the ideal line runs across the top of the chart. Lift charts compare holdout data with
mining model data. Holdout data is data that contains all the attributes included in the min-
ing model data, but, importantly, it also contains the predicted value. For example, if a time
series algorithm was predicting the rates of sale of particular models of a bicycle over a range
of time, the holdout data would contain actual values for rates of sale for a subset of that
same data. Another way to think of a lift chart is that it performs mining model accuracy
validation.

Useful data mining models are scored somewhere in the range above random guessing and
below perfection or ideal. The closer the model scores to the ideal line, the more effective it
is. Lift charts can be used to validate a single mining model. They are also useful to validate
multiple models at the same time. These models can be based on different algorithms, the
Chapter 13 Implementing Data Mining Structures 411

same algorithm with varied inputs (such as content types, attributes, filters, and so on), or a
combination of these techniques.

To show the functionality of a lift chart, we’ll work with the Targeting Mail sample data min-
ing structure included in the Adventure Works sample. To start, open the mining structure
in BIDS and then select all included mining models as input to the lift chart. Also, accept the
default, which is to use the holdout data that was captured when the model was first trained.
You’ll have to select a value for the predicted attribute. This is called PredictValue on the
Input Selection tab. For our sample, set the value to 1, which means customers who pur-
chased a bicycle.

Note If your sample model, Targeted Mailing, does not include holdout data, called mining
model test cases or mining structure test cases on the Input Selection tab, select the Specify A
Different Data Set option on that tab and then use the sample vTargetMail view included in
Adventure Works DW 2008 as the holdout data.

After you click on the second tab, Lift Chart, in BIDS, you’ll be able to review the output of
the lift chart showing which customers will be most likely to purchase bicycles. You might use
lift chart output to help you to determine which mining models are most accurate in pre-
dicting this value, and you might then base business decisions on the most accurate mining
model. A common scenario is targeting mailings to potential future customers—for example,
determining which model or models most accurately associates attributes with bike-buying
behavior. The lift chart output will help you to select those models and to take action based
on the output.

A lift chart can be used to validate the accuracy of multiple models, each of which predict
discrete values. A random (or average) guess value is shown by a blue line in the middle of
the chart; the perfect (or ideal) value is shown as a line at the top of the chart. The random
guess is always equal to a 50 percent probability of correctness of results. The perfect value is
always equal to 100 percent correct values. As shown in Figure 13-10, the random line always
measures at 50 percent of the perfect value line. You might wonder where the percentage
originates from. This is the result of a DMX query that SSAS automatically generates after
you configure the Input Selection tab of the Mining Accuracy Chart tab in BIDS. The (hold-
out) testing data is compared to the trained mining model data. If you want to see the exact
query that SSAS generates, you can just run SQL Server Profiler to capture the DMX.

This data contains the correct predicted value for the dataset—using our example, customers
who actually purchased a bicycle. It’s important that you understand that a lift chart shows
the probability of a prediction state and that it does not necessarily show the correctness of
the prediction state or value. For example, it could validate that a model predicts that some-
one either would not buy a bicycle or that they would buy a bicycle. The chart predicts the
validity of only those attribute state or states that you configure it to predict.
412 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIguRe 13-10 A lift chart allows you to validate multiple mining models.

The output format of the lift chart you receive depends on the content types in the model
or models used for input. If your model is based on continuous predictable attributes, the lift
chart output is a scatter plot rather than a straight-line chart. Also, if your mining model is
based on the Time Series algorithm, you must use an alternative method (namely, a manual
DMX prediction query) rather than a lift chart to validate it. Another possibility is that you
want to use a lift chart to validate a mining model that does not contain a predictive attri-
bute. In this case, you get a straight-line chart. It simply shows the probability of correct
selections for all possible values of the attribute—that is, both positive and negative values in
the case of a discrete attribute.

The lift chart output includes a mining legend as well. There is one row for each mining
model evaluated, as well as one row for the random guess and the best score (called an
Ideal Model). There are three columns in this legend: Score, Target Population, and Predict
Probability. The Score column shows the comparative merit for all the included models—
higher numbers are preferred.

For this example, the lift chart is showing that the mining model using the Microsoft Decision
Trees algorithm is producing the highest score and is, therefore, the best predictor of the
targeted value—that is, customers who will actually purchase a bicycle—of all of the mining
models being compared.

The Target Population column shows how much of the population would be correctly pre-
dicted at the value selected on the chart—that is, “purchased bicycle” or “did not purchase
bicycle.” That value is indicated by the gray vertical line in Figure 13-10, which is set at the
random guess value, or 50 percent of the population. In the same figure, you can see that
Chapter 13 Implementing Data Mining Structures 413

for the TM Decision Tree model, the target population captured (at around 50 percent of
the total population) is around 72 percent, which would, of course be, around 36 percent of
the entire possible population. The Predict Probability column shows the probability score
needed for each prediction to capture the shown target population. Another way to think
about the last column mentioned is in terms of selectivity. Using our example, the predict
probability values are quite similar: 50 percent and 43 percent, respectively. But what if they
were 40 percent and 70 percent?

If that were the case, you’d have to decide whether you’d like to send promotional mail to
future potential bicycle buyers even if over one-half (that is, 60 percent) who would not
respond (to get 72 percent), or whether you’d rather risk lower negative results—roughly
one-third in the second model to get a lower return (around 60 percent). Suffice it say that
the three values of the legend are used in conjunction to make the best business decisions.

Profit Charts
A profit chart displays the hypothetical increase in profit that is associated with using each
model that you’ve chosen to validate. As mentioned, lift charts and profit charts can be used
with all included algorithms except the Microsoft Time Series algorithm. You can view a profit
chart by selecting it from the Chart Type drop-down list above the charts. You configure a
profit chart by clicking the Profit Chart Settings button on the Lift Chart tab. This opens the
dialog box shown in Figure 13-11.

FIguRe 13-11 The Profit Chart Settings dialog box allows you to configure values to determine profit
thresholds.

The configurable values are as follows:

■■ Population Total number of cases used to create the profit chart


■■ Fixed Cost General costs associated with this particular business problem (for exam-
ple, cost of machinery and supplies)
414 Part II Microsoft SQL Server 2008 Analysis Services for Developers

■■ Individual Cost Individual item cost (for example, cost of printing and mailing each
catalog)
■■ Revenue Per Individual Estimated individual revenue from each new customer

After you’ve completed entering these configuration values, BIDS produces a profit chart
using a DMX query, just as is done when you create a lift chart using BIDS.

The profit chart output contains both graphical and legend information. The graph shows
the projected profit for the population based on the models you’ve included. You can also
use the mining legend to help you evaluate the effectiveness of your models. The legend
includes a row for each model. For the selected value on the chart, the profit chart legend
shows what the projected profit amount is. It also shows the predict probability for that profit
value. Higher numbers are more desirable here.

You can see in Figure 13-12 that the data mining model that is forecasted to produce the
largest profit from the Target Mailing data mining structure is the decision tree model.
Specifically, at the value we’re examining (around 52 percent of the population), the projected
profit for the decision tree model is greater than for any other mining model included in this
assessment. The value projected is around $190,000.

FIguRe 13-12 Profit charts show you profit thresholds for one or more mining models.

As with the lift chart output, you should also consider the value of the predict probability.
You’ll note that in our example this value is around 50 percent for decision trees. It varies
from a low of 42 percent for clustering to a high of 50 percent for decision trees. You’ll recall
this tells you how likely the first value—that is, profit—will occur.
Chapter 13 Implementing Data Mining Structures 415

Classification Matrix
For many types of model validation, you’ll be satisfied with the results of lift and profit charts.
In some situations, however, you’ll want even more detailed validation output. For example,
when the cost of making an incorrect decision is high—say you’re selling timeshare proper-
ties and offering potential buyers an expensive preview trip to a destination property—you’ll
need the detailed validation found in the classification matrix (sometimes called the confu-
sion matrix in data mining literature).

This matrix is designed to work with discrete predictable attributes only. It displays tabular
results that show the predicted values and the actual values for one or more predictable
attributes. These results are sometimes grouped into the following four categories: false posi-
tive, true positive, false negative, and true negative. So it’s much more specific in functionality
than either the lift chart or profit chart. It reports exact numbers for the four situations. As
with lift charts, the testing data you configured on the Input Selection tab is used as a basis
for producing results for the classification matrix.

The output of this validator is a matrix or table for each mining model validated. Each matrix
shows the predicted values for the model (on rows) and the actual values (on columns).
Looking at this table, you can see exactly how many times the model made an accurate pre-
diction. The classification matrix analyzes the cases included in the model according to the
value that was predicted, and it shows whether that value matches the actual value.

Figure 13-13 shows the results for the first mining model in the group we’re analyzing: the
decision tree model. The first result cell, which contains the value 6434, indicates the number
of true positives for the value 0 (zero). Because 0 indicates the customer did not purchase
a bike, this statistic tells you that model predicted the correct value for members of the
data population who did not buy bikes in 6434 cases. The cell directly underneath that one,
which contains the value 2918, tells you the number of false positives, or how many times
the model predicted that someone would buy a bike when actually she did not. The cell that
contains the value 2199 indicates the number of false positives for the value 1. Because 1
means that the customer did purchase a bike, this statistic tells you that in 2199 cases, the
model predicted someone would not buy a bike when in fact he did. Finally, the cell that
contains the value 6933 indicates the number of true positives for the target value of 1. In
other words, in 6933 cases the model correctly predicted that someone would buy a bike.

By summing the values in cells that are diagonally adjacent, you can determine the overall
accuracy of the model. One diagonal tells you the total number of accurate predictions, and
the other diagonal tells you the total number of erroneous predictions. So to continue our
example, for the decision tree model the correct predictions are as follows: 6434 + 6933 =
13367. And the incorrect predictions are expressed in this way: 2918 + 2199 = 5117. You can
repeat this process to determine which model is predicting most accurately.
416 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIguRe 13-13 A classification matrix shows your actual and false positive (and negative) prediction values.

So, to evaluate across the models included in the Targeted Mailing sample structure, you
might want to produce a summary table that looks like the one in Table 13-1.

TABLe 13-1 Sample Results for Targeted Mailing Structure


Type of Model Correct Incorrect
TM Decision Tree 13367 5117
TM Clustering 11151 7333
TM Naïve Bayes 11671 6813
TM Neural Net 12237 6247

You can also use the classification matrix to view the results of predicted values that have
more than two possible states. SQL Server Books Online recommends copying the values to
the Clipboard and then copying that data to Excel (as we did to produce the Table 13-1) using
the button available on the Classification Matrix tab in BIDS to facilitate this activity.
Chapter 13 Implementing Data Mining Structures 417

Cross Validation
In addition to the three validation tools we’ve covered, SQL Server 2008 introduces a new
type of validation tool. It works a bit differently than the existing tools. The cross validation
tool was added specifically to address requests from enterprise customers. Keep in mind that
cross validation does not require separate training and testing datasets. You can use testing
data, but you won’t always need to. This elimination of the need for holdout (testing) data
can make cross validation more convenient to use for data mining model validation.

Cross validation works by automatically separating the source data into partitions of equal
size. It then performs iterative testing against each of the partitions and shows the results in
a detailed output grid. An output sample is shown in Figure 13-14. Cross validation works
according to the value specified in the Fold Count parameter on the Cross Validation tab
of the Mining Accuracy Chart tab in BIDS. The default value for this parameter is 10, which
equates to 10 sets. If you’re using temporary mining models to cross validate in Excel 2007,
10 is the maximum number of allowable folds. If you’re using BIDS, the maximum number
is 256. Of course, a greater number of folds equates to more processing overhead. For our
sample, we used the following values: Fold Count of 4, Max Cases of 100, Target Attribute
of Bike Buyer, and Target State of 1. Cross validation is quite computationally intensive, and
we’ve found that we often need to adjust the input parameters to achieve an appropriate
balance between performance overhead and results.

FIguRe 13-14 The new cross validation capability provides sophisticated model validation.
418 Part II Microsoft SQL Server 2008 Analysis Services for Developers

You can also implement cross validation using newly introduced stored procedures. See the
SQL Server Books Online topic “Cross Validation (Analysis Services – Data Mining)” for more
detail.

Note that the output shown contains information similar to that displayed by the classifica-
tion matrix—that is, true positive, false positive, and so on. A reason to use the new cross-val-
idation capability is that it’s a quick way to perform validation using multiple mining models
as source inputs.

Note Cross validation cannot be used to validate models built using the Time Series or
Sequence Clustering algorithms. This is logical if you think about it because both of these
algorithms depend on sequences and if the data was partitioned for testing, the validity of the
sequence would be violated.

After you’ve validated your model or models, you might want to go back, make some
changes to improve validity, and then revalidate. As mentioned, such changes can include
building new models (using different algorithms), changing algorithm parameter values, add-
ing or removing source columns, reconfiguring attribute values, and more. We find model
building to be an iterative process. A couple of best practices are important to remember
here, however:

■■ The cleaner the source data is, the better. In the world of data mining, there are already
many variables. One variable that can diminish the value of the results is messy or dirty
data.
■■ Algorithms are built for specific purposes; use the correct tool for the job. Some algo-
rithms are more easily understood than others. For example, it’s obvious what you’d
use the Time Series algorithm for. It might be less obvious, at least at first, when to use
Naïve Bayes and Clustering.
■■ Stick with parameter default values when you’re first starting. Again, there are many
variables, and until you have more experience with both data mining capabilities and
the source data, reduce the number of variables.

Again, at this point, you might be done and ready to deploy your mining models (as reports)
for end users to work with and to use as a basis for decision support. There is an additional
capability that some of you will want to use in your data mining projects: the ability to query
your mining models. Specifically, you can use your model as a basis for making predictions
against new data. This is accomplished using DMX prediction-type queries. As with the
validation process, BIDS provides you with a graphical user interface, called Mining Model
Prediction, that allows you to visually develop DMX prediction queries. We’ll explore that in
the next section.
Chapter 13 Implementing Data Mining Structures 419

Data Mining Prediction Queries


Data mining prediction queries can be understood as conceptually similar to relational data-
base queries. That is, this type of query can be performed ad hoc (or on demand) using the
GUI query-generation tools included in BIDS (or SSMS), or by writing the DMX query state-
ment by hand. A more interesting use of DMX prediction queries is, however, in application
integration. Just as database queries, whether relational or multidimensional, can be inte-
grated into client applications, so too can mining model prediction queries.

An example of a business use of such capability is viability for a bank loan. The SQL Server
2008 Data Mining Add-ins for Office 2007 introduces a Prediction Calculator to the Table
Tools Analyze tab on the Excel 2007 Ribbon. We’ll cover this capability in more detail in
Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007.”
The idea is that new input data can be used as a basis for the existing mining model to pre-
dict a value. We believe that custom application integration provides a significant oppor-
tunity for application developers to harness the power of SSAS data mining in their own
applications.

First, of course, we’ll need to review the mechanics of how to perform DMX prediction que-
ries. Two types of queries are supported in the BIDS interface: batch and singleton. Singleton
queries take one single row of data as input. Batch queries take a dataset as input. There are
three viewers on the Mining Model Prediction tab. They are the visual DMX query builder,
the SQL (native DMX statement) query builder, and the Results view.

The first step in creating a query is to select an input table. (Batch query is the default.) If you
want to use a singleton query, click the third button from the left on the toolbar. After doing
so, the Input Table dialog box changes to a single input dialog box. For the purposes of our
discussion, we’re using the sample Target Mailing data mining structure and the TM Decision
Tree data mining model. We’ve configured vTargetMail as the input table. Of course, this
input view is the same one we used to build the mining model. For a real-world query, you
use new source data.

Figure 13-15 shows that the column name mappings are detected automatically in the query
interface. If you want to view, alter, or remove any of these mappings, just right-click on
the designer surface and then click Modify Connections. You’ll be presented with a window
where you can view, alter, or remove connections between source and destination query
attributes (columns) for batch queries.
420 Part II Microsoft SQL Server 2008 Analysis Services for Developers

FIguRe 13-15 The Mining Model Prediction tab (top portion) helps you build DMX predict queries visually by
allowing you to configure input values.

After you’ve configured the source and new input values, the next step in building your DMX
prediction query is to configure the query itself. Using the BIDS designer, you can either type
the DMX into the query window or you can use the guided designer. If using the guided
designer, you’ll first configure the Source column. There you can select the mining model, the
input, a prediction function, or a custom expression. This is shown in Figure 13-16. You’ll most
commonly select a prediction function as a source. We’ll talk a bit more about the prediction
functions that are available in DMX a bit later in this chapter.

FIguRe 13-16 The Mining Model Prediction tab (bottom portion) helps you build DMX predict queries visu-
ally by allowing you to select functions or to write criteria.

Next you configure the Field value. If you choose a prediction function as a source, you’ll
select from the available prediction functions to configure the Field text box. After you select
a particular prediction function, the Criteria/Argument text box is filled in with a template
showing you the required and optional arguments for the function you’ve selected. For
example, if you select the Predict function, the Criteria text box contains <Scalar column
reference>[, EXCLUDE_NULL|INCLUDE_NULL][, INCLUDE_NODE_ID]. This indicates that the
scalar column reference is a required argument (denoted by the angle brackets) and that the
other arguments are optional (denoted by the square brackets). For our example, we’ll simply
replace the column reference with the [Bike Buyer] column. You can see the DMX that has
Chapter 13 Implementing Data Mining Structures 421

been generated by clicking the Query (SQL) button at the left end of the toolbar. For our
example, the generated DMX is as follows:

SELECT
Predict([Bike Buyer])
From
[TM Decision Tree]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
SELECT
[MaritalStatus],
[Gender],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[HouseOwnerFlag],
[NumberCarsOwned],
[CommuteDistance],
[Region],
[Age],
[BikeBuyer]
FROM
[dbo].[vTargetMail]
) AS t
ON
[TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND
[TM Decision Tree].[Gender] = t.[Gender] AND
[TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND
[TM Decision Tree].[Total Children] = t.[TotalChildren] AND
[TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND
[TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned] AND
[TM Decision Tree].[Commute Distance] = t.[CommuteDistance] AND
[TM Decision Tree].[Region] = t.[Region] AND
[TM Decision Tree].[Age] = t.[Age] AND
[TM Decision Tree].[Bike Buyer] = t.[BikeBuyer]

To execute the query, just click the Result button at the left end of the toolbar. To help you
understand the capabilities available when using prediction queries, we’ll take a closer look at
the prediction functions. See the Books Online topic “Data Mining Extensions (DMX) Function
Reference” for more information. To explore DMX, we’ll open SSMS to look at the DMX pre-
diction template queries that are included as part of all SSAS templates. Let’s take a look at
both of those next.

DMX Prediction Queries


The Data Mining Extensions language is modeled after the SQL query language. You prob-
ably recognized the SQL-like syntax of SELECT…FROM…JOIN…ON from the code generated in
the preceding example. Note the use of the DMX Predict function and the PREDICTION JOIN
keyword. We’ll start by looking at both of these commonly used elements.
422 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The syntax for PREDICTION JOIN is much like the syntax for the Transact-SQL JOIN keyword.
The difference is that, for the former, the source table is a data mining model and the joined
table is the new input data. The connection to the new input data is made either using
the SQL keyword OPENROWSET or OPENQUERY. OPENROWSET requires the connection
credentials to be included in the query string, so OPENQUERY is the preferred way to con-
nect. OPENQUERY uses a data source that is defined in SSAS. There is one variation for the
PREDICTION JOIN syntax. It’s to use NATURAL PREDICTION JOIN. You can use this when the
source columns are identical to the attribute (column) names in the mining model. NATURAL
PREDICTION JOIN is also commonly used when you’re performing a singleton (that is, single
value as input) prediction query.

By now, you might be wondering what common query scenarios occur in the world of data
mining. In the world of application integration, there are generally three types of queries:
content, prediction, and prediction join. Content queries simply return particular data or
metadata from a processed mining model. Prediction queries can be used to execute pre-
dictions, such as forward steps in a time series. Prediction join queries are a special type of
prediction query in that they use data from a processed model to predict values in a new
dataset.

Note There are other types of DMX queries related to model administration. These include
structure, model content, and model management. You might also remember that both PMML
and XMLA are involved in data mining object management. These topics aren’t really in the
scope of our discussion on DMX for developers.

To take a look at the included DMX templates in SSMS, connect to SSAS in SSMS, display the
Template Explorer, and then click on the DMX node to view and work with the included tem-
plates. Figure 13-17 shows the included DMX templates. Note that the templates are catego-
rized by language (DMX, MDX, and XMLA) and then by usage—that is, by Model Content,
and so on.

We’ve dragged a couple of the prediction queries to the query design window so that we can
discuss the query structure for each of them. In the first example, Base Prediction, you can
clearly see the familiar SQL-like keyword syntax of SELECT…FROM…(PREDICTION) JOIN…ON…
WHERE. Note that all DMX keywords are colored blue, just as keywords in Transact-SQL are.
As mentioned previously, values between angle brackets represent replaceable parameters
for required arguments and values between square brackets represent optional arguments.
The second example, Nested Prediction, shows the syntax for this type of query. Nested
predictions use more than one relational table as source data. They require that the data in
the source tables has a relationship and is modeled as a CASE table and as a NESTED table.
For more information, see the Books Online topic “Nested Tables (Analysis Services—Data
Mining)” at http://msdn.microsoft.com/en-us/library/ms175659.aspx. Note the use of the fol-
lowing keywords to create the nested hierarchy: SHAPE, APPEND, and RELATE.
Chapter 13 Implementing Data Mining Structures 423

FIguRe 13-17 The Template Explorer in SSMS provides you with three different types of DMX queries.

The last code example shown in Figure 13-17 shows the syntax for a singleton prediction
query. Here you’ll note the use of the syntax NATURAL PREDICTION JOIN. This can be used
because the input column names are identical to those of the mining model.

DMX Prediction Functions


DMX contains many prediction functions. SQL Server Books Online provides a good descrip-
tion of each of them. See the topic “Data Mining Extensions (DMX) Function Reference.”

The core function in the DMX prediction library is the Predict function. It can return either a
single, specific value (scalar) or a set of values (table).

Although there are many specific prediction functions that are designed to work with spe-
cific algorithms, the query syntax is somewhat simplified by the fact that the Predict function
itself supports polymorphism. What this means is that you can often simply use the Predict
function rather than a more specific type of prediction, such as PredictAssociation, and SSAS
automatically uses the appropriate type of prediction function for the particular algorithm
(model) being queried.
424 Part II Microsoft SQL Server 2008 Analysis Services for Developers

The Predict function includes several options. The availability of these options is dependent
on the return type specified—that is, scalar or table. For tabular results, you can specify these
options:

■■ Inclusive Includes everything in the results


■■ Exclusive Excludes source data in the results
■■ Input_Only Includes only the input cases
■■ Include_Statistics Includes statistical details; returns as two additional columns,
$Probability and $Support
■■ Top count Returns the top n rows, with n specified in the argument

For scalar return types, you can specify either the Exclude_Null or Include_Null option.

You can write and test your prediction queries in BIDS, SSMS, or any other tool that sup-
ports DMX queries. Excel 2007 also executes DMX queries after it has been configured with
Data Mining Add-ins. If you want to see the exact syntax of a DMX query that some end-
user application runs, you can use SQL Server Profiler to capture the DMX query activity. We
described the steps to use SQL Server Profiler to do this in an earlier chapter.

Now that you understand the basic prediction query syntax and options, let’s take a closer
look at the prediction functions that are available as part of DMX. Table 13-2 lists the func-
tion, its return type, and a brief description.

Now that you have a better understanding of the available predictive functions in the DMX
language, we’ll return to the Mining Model Prediction tab in BIDS. We’ve switched the query
view from the default of batch input to singleton input by right-clicking on the designer
surface and then selecting Singleton Query from the shortcut menu. Next, we configured
the values for the singleton input by selecting values from the drop-down lists next to each
Mining Model Column entry in the Singleton Query Input pane. Following that, we config-
ured the query portion (bottom section) by setting the Source to a Prediction Function, then
selecting the Predict function in the Field area, and finally by dragging the Bike Buyer field
from the Mining Model pane to the Criteria/Argument area.

We then clicked the Query Mode button (first button on the toolbar) and then clicked the
Query View button in the drop-down button list. This results in the DMX query shown in
Figure 13-18. If you’d like to execute the query you’ve built, click the Query Mode button and
then click the Result button.
Chapter 13 Implementing Data Mining Structures 425
TABLe 13-2 DMX Prediction Functions
Function Returns Notes
Predict Scalar Core function
PredictSupport Scalar Count of cases that support predicted value
PredictVariance Scalar Variance distribution for which Predict is the mean (for
continuous attributes)
PredictStdev Scalar Square root of PredictVariance
PredictProbability Scalar Likelihood that the Predict value is correct
PredictProbabilityVar Scalar Certainty that the value of PredictVariance is accurate
PredictProbabilityStdev Scalar Square root of PredictProbabilityVar
PredictTimeSeries Table Predicted value of next n in a time series
Cluster Scalar ClusterID that the input case belongs to with highest probability
ClusterDistance Scalar Distance from predicted ClusterID
ClusterProbability Scalar Probability value of belonging to predicted Cluster
RangeMid Scalar Midpoint of predicted bucket (discretized columns)
RangeMin Scalar Low point of predicted bucket (discretized columns)
RangeMax Scalar High point of predicted bucket (discretized columns)
PredictHistogram Rowset Histogram of each possible value, with predictable column
and probability, expressed as $Support, $Variance, $Stdev,
$Probability, $Adjusted Probability, $Probability Variance, and
$Probability Stdev

FIguRe 13-18 Singleton queries can be built on the Mining Model Prediction tab in BIDS.
426 Part II Microsoft SQL Server 2008 Analysis Services for Developers

In addition to viewing the query in the Query mode, you can also edit the query directly by
typing the DMX code into the window.

The BIDS code window color-codes the DMX—for example, by coloring DMX keywords
blue—but unfortunately, no IntelliSense is built in to the native BIDS DMX code window. The
SSMS DMX query window functions similarly—that is, it has color-coded keywords but no
IntelliSense.

Note that in the query the type of prediction join configured for the singleton input is auto-
matically set to NATURAL PREDICTION JOIN. This is because the input column names are
identical to those of the data mining model to which the join will be performed.

As we complete our brief look at DMX, we remind you that we’ve found a number of ways
to implement DMX queries. These include ad hoc querying using BIDS or SSMS. We use this
method during the early phases of the development cycle. We also sometimes use manual
query execution to diagnose and troubleshoot key queries, such as those that might have
been generated by end-user client tools. We use SQL Server Profiler to capture the syntax of
the DMX query and then tinker with the query using the ad hoc query environments in either
BIDS or SSMS.

Another use of custom DMX queries is to embed them into custom applications. We think
this is an exciting opportunity area for application developers. The potential for integrating
data mining predictive analytics in custom applications is a new and wide-open area for most
companies. Just one interesting example of custom application integration is to use data
mining as a type of input validation in a form-based application. Rather than writing custom
application-verification code, you can use a trained data mining model as a basis for building
input validation logic. In this example, you send the input values from one or more text boxes
on the form as (singleton) inputs to a DMX query. The results can be coded against thresh-
olds to serve as indicators of valid or invalid text box entries. What is so interesting about
this use case is that as the source data changes (is improved), the input validation results also
improve. You can think of this as a kind of dynamic input validation.

Before leaving the subject of DMX, we’d be remiss if we didn’t talk a bit about the integra-
tion between SSAS data mining and SSIS. SSIS includes several types of integration with both
OLAP cubes and data mining models. In the next section, we’ll look specifically at the inte-
gration between data mining and SSIS.

Data Mining and Integration Services


We realize that if you’re reading this book in the order that it’s written, this section will be the
first detailed look you’ve taken at Integration Services. If you’re completely new to SSIS, you
might want to skip this section for now, read at least the introductory chapter on SSIS (which
is the next chapter in the book), and then return to this section.
Chapter 13 Implementing Data Mining Structures 427

Data mining by DMX query is supported as a built-in item in SSIS as long as you’re using the
Enterprise edition of SQL Server 2008. You have two different item types to select from. The
first type, Data Mining Query Task, is available in the control flow Toolbox. The editor for this
task is shown in Figure 13-19.

FIguRe 13-19 The Data Mining Query Task Editor dialog box allows you to associate a data mining query with
the control flow of an SSIS package.

The second type is available in the data flow Toolbox in the transformation section: it’s the
Data Mining Query data flow component. As with the control flow task, the data flow com-
ponent allows you to associate a data mining query with an SSIS package. The difference
between the task and the component is, of course, where they are used in your package—
one is for control flow, and the other is for data flow. Note that there are three tabs in the
dialog box: Mining Model, Query, and Output. On the Mining Model tab, you configure
the connection to the SSAS instance using a connection manager pointing to your Analysis
Services instance and then select the mining structure. You’ll see a list of mining models that
are included in the mining structure you’ve selected.

On the Query tab, you configure your DMX query either by directly typing the DMX into the
query window or by using the Build New Query button on the bottom right of the Query
tab of the Task Editor dialog box. Clicking this button opens the (now familiar) dialog box
you saw in BIDS when you were using the Mining Model Prediction tab. This tab allows you
to select a data mining model as a source for your query and then to select the source of
the query input, just as you did using BIDS. In our case, we selected the TM Decision Tree
428 Part II Microsoft SQL Server 2008 Analysis Services for Developers

model as the source and the vTargetMail view as the input. Then, as we did in the earlier
query example using the BIDS query interface, we continued our configuration by adding the
Predict function using the [Bike Buyer] column as our argument. Note that the DMX query
produced uses PREDICTION JOIN to perform the join between the source and new input
data. A sample DMX query is shown in the Data Mining Query Task Editor in Figure 13-20.

FIguRe 13-20 The data mining query task allows you to write, edit, and associate variables with
your DMX query.

The tabs within the Query section allow you to perform parameter mapping or to map the
result set using variables defined in the SSIS package. The last main tab in the Data Mining
Query Task Editor is the Output tab. This allows you to map the output of this task to a table
using a connection you define in the SSIS package.

If you use the Data Mining Query component in the data flow section, the DMX query-
building is accomplished in a similar way to using the Data Mining Query task in the control
flow section. The difference is that, for the component, you configure the input from the
incoming data flow rather than in the task itself. Also, you must connect the data flow task to
a data flow destination (rather than using an Output tab on the task itself) to configure the
output location for the results of your DMX query.

There are additional tasks available in SSIS that relate to SSAS data mining. However, they do
not relate to DMX querying. Rather, they relate to model processing. Before we explore these
tasks, let’s take a closer look at data mining structure and model processing in general.
Chapter 13 Implementing Data Mining Structures 429

Data Mining Object Processing


As we’ve seen, when you process a mining model using BIDS a dialog box showing you the
process progress is displayed. You might be interested to know that you can use a number
of methods or tools to accomplish data mining object processing. These include SSMS, SSIS,
Excel 2007, and XMLA script.

To understand what structure or model processing entails, let’s take a closer look at the
options available for processing. For mining structures, you have five options related to
processing:

■■ Process Full Completely erases all information, and reprocesses the structure, models,
and trains using source data.
■■ Process Default Performs the least disruptive processing function to return objects to
ready state.
■■ Process Structure Populates only the mining structure (but not any included mining
models) with source data.
■■ Process Clear Structure Removes all training data from a structure. This option is
often used during early development cycles, and it works only with structures, not with
models.
■■ Unprocess Drops the structure or model and all associated training data. Again, this
option is often used during early development cycles for rapid prototyping iterations.

As with OLAP object processing, during development cycles, it’s quite common to simply run
a full process every time you update any structures or models. As you move to deployment,
it’s more common to perform (and automate) a more granular type of processing. For data
mining objects, this is more commonly accomplished using the Process Default option.

So, what exactly happens during processing? Let’s look at the Process Progress dialog box,
shown in Figure 13-21, to increase our understanding. This dialog box looks just like the one
that is displayed when you process OLAP objects. The difference is in how the data mining
objects are processed and what the shape of the results is. SSAS data mining uses the SQL
INSERT INTO and OPENROWSET commands to load the data mining model with data during
the processing or training phase. If a nested structure is being loaded, the SHAPE command
is used as well.

If there are errors or warnings generated during processing, they are displayed in the
Process Progress dialog box as well. One interesting warning to keep an eye out for is the
updated automatic feature selection warning. Feature selection is automatically applied
when SSAS finds that the number of attributes fed to your model would result in exces-
sive processing overhead without improving the value of the results. It works differently,
depending on which algorithm you are using. Very broadly, feature selection uses one or
430 Part II Microsoft SQL Server 2008 Analysis Services for Developers

more methods of attribute scoring to score each source attribute. Calculated score values
are then used by various ranking algorithms to decide whether or not to include those
attributes in the model’s training data. You can manually tune feature selection by configur-
ing the maximum number of input or output attributes, or by number of states. For more
information about how feature selection is applied when models built on various algorithms
are trained, see the SQL Server Books Online topic “Feature Selection in Data Mining” at
http://msdn.microsoft.com/en-us/library/ms175382.aspx.

FIguRe 13-21 Process Progress dialog box

A final consideration is to understand that, as with OLAP objects, you can use included con-
trol flow tasks or data flow components in SSIS to automate the processing of data mining
objects. One of these tasks is the Analysis Services processing task. This task is found in the
Toolbox in SSIS. It allows you to configure any type of processing for any mining model or
structure as part of an SSIS control flow and resulting executable package.

In addition to the task already mentioned, in the data flow section of SSIS, there is an addi-
tional built-in component that you can use to automate data mining object processing. This
component is the data mining model training component. It’s available only in the Enterprise
edition of SQL Server 2008 SSIS. It’s used as a destination, and it requires input columns from
a data source in the SSIS data flow. In this component, on the Columns tab, you configure
the column mapping that will be used to load (or train) the destination data mining model.
This consists of matching the data flow (input) columns to the destination (mining structure)
columns.
Chapter 13 Implementing Data Mining Structures 431

In addition to using SSIS to automate data mining object processing, you can use other
methods. These include using SSMS or using XMLA scripting. One final consideration when
you’re moving from development to deployment is security for your data mining structures
and models. Similar to OLAP objects, the simplest way for you to define custom security set-
tings is for you to create and configure roles using either BIDS or SSMS.

Also common between data mining and OLAP is the fact that by default only an administra-
tor can access any of the objects. You might remember from an earlier chapter that SSAS
roles allow you to define permissions at the level of the entire database (which includes all
data sources and all cubes and mining structures) or much more granularly, such as read
permission on a single data source, cube, dimension, or mining structure. We show the min-
ing structure security interface in Figure 13-22. Note that when you examine this interface,
you set the level of access (options are None or Read), whether or not to enable drillthrough,
whether or not to allow the definition to be read, and whether or not to allow processing.

FIguRe 13-22 Security role configuration options for data mining models

Data Mining Clients


There is one final important consideration in the data mining software development life
cycle—that is, considering what the end-user client applications will be. We devote several
chapters in Part IV to this topic. Whatever client or clients you choose for your BI project, it’s
critical that you investigate their support for advanced features (for example, drillthrough) of
SQL Server 2008 data mining before you build such features into your models using BIDS.

In Chapter 24, “Microsoft Office 2007 as a Data Mining Client,” we examine the integration
between Microsoft Office 2007 and SQL Server 2008 data mining. This integration fea-
ture set is currently available in Excel 2007 and Visio 2007. We’ll also provide information
about integration between SSRS and SSAS data mining in Chapter 20, “Creating Reports
in SQL Server 2008 Reporting Services.” We’ll take a quick look at Office SharePoint Server
2007 and PerformancePoint Server data mining integration as well in Chapter 25. And, recall
that in Chapter 12 we looked at other methods of implementing data mining clients, both
purchased and built, including the data mining controls that can be embedded into custom
.NET applications.
432 Part II Microsoft SQL Server 2008 Analysis Services for Developers

Summary
In this chapter, we looked at the data mining structure and model creation wizards in BIDS.
We then reviewed the included capabilities for validating your mining models. These pro-
cesses include lift and profit charts, as well as the classification matrix and the newly intro-
duced cross validation capability. We then explored the Mining Model Prediction tab in BIDS.
This led us to a brief look at DMX itself. We followed that by looking at DMX query and min-
ing model processing integration with SSIS.

This completes the second major section of this book. By now, you should be comfortable
with OLAP cube and data mining concepts and their implementation using BIDS. In the next
chapter, we’ll take a deeper look at the world of ETL using SSIS. Be aware that we’ll include
information about integration with SSRS and other clients with SSAS objects in the section
following the SSIS section.
Part III
Microsoft SQL Server 2008
Integration Services
for Developers

433
Chapter 14
Architectural Components of
Microsoft SQL Server 2008
Integration Services
In Part II, we looked at the ways in which Microsoft SQL Server 2008 Analysis Services (SSAS)
delivers a set of core functionality to enable analysis of data warehouse data within a busi-
ness intelligence (BI) solution. But how does the data get from the disparate source systems
into the data warehouse? The answer is Microsoft SQL Server 2008 Integration Services (SSIS).
As the primary extract, transform, and load (ETL) tool in the SQL Server toolset, Integration
Services provides the core functionality required to extract data from various source sys-
tems that contain the information for a BI solution, transform the data into forms required
for analysis, and load the data into the data warehouse. These source systems can include
RDBMS systems, flat files, XML files, and any other source to which Integration Services can
connect.

SSIS can do much more than simply perform ETL for a BI solution, but in this part of the book
we’re going to focus primarily on the use of SSIS as an ETL platform for loading the data
warehouse in a BI solution. In this part, we will also cover best practices and common tech-
niques for preparing data for loading into an SSAS data mining model.

This chapter introduces SSIS from an architectural perspective by presenting the significant
architectural components of the SSIS platform: what pieces make up SSIS, how they work
together, and how you as an SSIS developer work with them to build the ETL components for
your BI solution. We will look at the major building blocks of SSIS, including the SSIS runtime,
packages and their components, and the tools and utilities through which you will design,
develop, and deploy ETL solutions. By the end of this chapter you will have the technical
foundation for the content presented in the chapters that follow. You will also have an under-
standing of most of the core capabilities of the SSIS platform. Developers new to SQL Server
2008 sometimes ask us, “Why should I use SSIS rather than more traditional methods of data
management, such as Transact-SQL scripts?” We need more than one chapter to answer this
question completely; however, our goal for this chapter is to give you a preliminary answer,
based mostly on an understanding of SSIS architecture.

435
436 Part III Microsoft SQL Server 2008 Integration Services for Developers

Overview of Integration Services Architecture


The Integration Services platform includes many components, but at the highest level, it is
made up of four primary parts:

■■ Integration Services runtime The SSIS runtime provides the core functionality neces-
sary to run SSIS packages, including execution, logging, configuration, debugging, and
more.
■■ Data flow engine The SSIS data flow engine (also known as the pipeline) provides the
core ETL functionality required to move data from source to destination within SSIS
packages, including managing the memory buffers on which the pipeline is built and
the sources, transformations, and destinations that make up a package’s data flow logic.
■■ Integration Services object model The SSIS object model is a managed .NET applica-
tion programming interface (API) that allows tools, utilities, and components to interact
with the SSIS runtime and data flow engine.
■■ Integration Services service The SSIS service is a Windows service that provides func-
tionality for storing and managing SSIS packages.

Note The Integration Services service is an optional component of SSIS solutions. Unlike the
SQL Server service, which is responsible for all access to and management of the data in a SQL
Server database, the SSIS service is not responsible for the execution of SSIS packages. This is
often a surprise to developers new to SSIS, who expect that a service with the same name as the
product will be the heart of that product. This isn’t the case with SSIS. Although the Integration
Services service is useful for managing deployed SSIS solutions and for providing insight into the
status of executing SSIS packages, it is perfectly possible to deploy and execute SSIS packages
without the SSIS service running.

These four key components form the foundation of SSIS, but they are really just the start-
ing point for examining the SSIS architecture. Of course the principal unit of work is the SSIS
package. We’ll explain in this chapter how all of these components work together to allow
you to build, deploy, execute, and log execution details for SSIS packages.

Figure 14-1 illustrates how these four parts relate to each other and how they break down
into their constituent parts. Notice in Figure 14-1 that the metacontainer is an SSIS package.
It is also important to note that the primary component for moving data in an SSIS package
is the Data Flow task.

We’ll spend the rest of the chapter taking a closer look at many of the architectural compo-
nents shown in Figure 14-1. In addition to introducing each major component and discuss-
ing how the components relate to each other, this chapter also discusses some of the design
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 437

goals of the SSIS platform and how this architecture realizes these goals. After this introduc-
tory look at SSIS architecture, in subsequent chapters we’ll drill down to examine all of the
components shown in Figure 14-1.

Custom Applications SSIS Designer


Integration
Services
Command-Line Utilities SSIS Wizards Service

.dtsx File
Native Managed
Tasks Object Model
Integration Services Runtime
Custom
Tasks Package
Task msdb Database

Log Task Container Enumerators


Providers
Task

Connection
Data Task Task Managers
Sources

Event
Data Flow Task Handlers
Object Model
Integration Services Datra Flow

Source Source

Transformation
Data Flow
Components
Transformation
Custom
Data Flow
Components Destination Destination

FIgure 14-1 The architecture of Integration Services

Note When comparing SSIS to its predecessor, DTS, it’s not uncommon to hear someone
describe SSIS as “the new version of DTS,” referring to the Data Transformation Services function-
ality included in SQL Server 7.0 and SQL Server 2000. Although SSIS and DTS are both ETL tools,
describing SSIS as a new version of DTS does not accurately express the scope of differences
between these two components. In a nutshell, the functionality of DTS was not what Microsoft’s
customers wanted, needed, or expected in an enterprise BI ETL tool.
438 Part III Microsoft SQL Server 2008 Integration Services for Developers

SSIS was completely redesigned and built from the ground up in SQL Server 2005 and
enhanced in SQL Server 2008. If you work with the SSIS .NET API, you will still see the term
DTS in the object model. This is because the name SSIS was chosen relatively late in the SQL
Server 2005 development cycle—too late to change everything in the product to match the
new marketing name without breaking a great deal of code developed by early adopters—
not because code is left over from DTS in SSIS.

Integration Services Packages


As mentioned earlier, when you develop an ETL solution by using SSIS, you’re going to be
developing packages. SSIS packages are the basic unit of development and deployment in
SSIS and are the primary building blocks of any SSIS ETL solution. As shown in Figure 14-1,
the package is at the heart of SSIS; having SSIS without packages is like having the .NET
Framework without any executables. An SSIS package is, in fact, a collection of XML that
can be run, much like an executable file. Before we examine the core components of an SSIS
package, we’ll take a brief look at some of the tools and utilities available to help you to
develop, deploy, and execute packages.

Tools and Utilities for Developing, Deploying, and Executing


Integration Services Packages
SSIS includes a small set of core tools that every SSIS developer needs to use and under-
stand. SSIS itself is an optional component of a SQL Server 2008 installation. Installing the
SSIS components requires an appropriate SQL Server 2008 license. Also, some features of
SSIS require the Enterprise edition of SQL Server 2008. For a detailed feature comparison,
see http://download.microsoft.com/download/2/d/f/2df66c0c-fff2-4f2e-b739-bf4581cee533/
SQLServer%202008CompareEnterpriseStandard.pdf.

This installation includes a variety of tools to work with SSIS packages. Additional tools are
included with SQL Server 2008, as well as quite a few add-ons provided by Microsoft, third
parties, and an active community of open source SSIS developers. However, the tools intro-
duced here—SQL Server Management Studio, Business Intelligence Development Studio, and
the command-line utilities DTEXEC, DTEXECUI, and DTUTIL—are enough to get you started.

SQL Server Management Studio


As you know from earlier chapters, SQL Server Management Studio (SSMS) is used primarily
as a management and query tool for the SQL Server database engine, but it also includes the
ability to manage SSAS, SSRS, and SSIS. As shown in Figure 14-2, the Object Explorer window
in SSMS can connect to an instance of the SSIS service and be used to monitor running pack-
ages or to deploy packages to various package storage locations, such as the msdb database
or the SSIS Package Store.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 439

FIgure 14-2 SSMS Object Explorer displays SSIS packages.

It’s important to keep in mind that you cannot use SSMS to develop SSIS packages. Package
development is performed in BIDS, not SSMS. ETL developers who have worked with DTS
in SQL Server 7.0 and SQL Server 2000 often look for package-development tools in SSMS
because SSMS is the primary replacement for SQL Server Enterprise Manager, and Enterprise
Manager was the primary development tool for DTS packages. The advent of SSIS in SQL
Server 2005, however, brings different tools for building and managing packages.

Note The Import/Export Wizard is the one exception to the “you can’t develop SSIS packages
in SSMS” concept. You start this tool by right-clicking any user database in the SSMS Object
Explorer, clicking Tasks, and then clicking either Import Data or Export Data. The Import/Export
Wizard allows you to create simple import or export packages. In this tool, you select the data
source, destination, and method you will use to copy or move the objects to be imported or
exported. You can choose to select the entire object or you can write a query to retrieve a
subset. Objects available for selection include tables or views. You can save the results of this
wizard as an SSIS package and you can subsequently edit this package in the SSIS development
environment.

Business Intelligence Development Studio


The primary tool that SSIS package developers use every day is the Business Intelligence
Development Studio (BIDS). We’ve referred to BIDS in several earlier chapters, and in Part II,
you saw its uses in SQL Server Analysis Services. BIDS also contains the necessary designers
for creating, editing, and debugging Integration Services components, which include con-
trol flows and data flows, event handlers, configurations, and anything else that packages
need. As with SSAS development, SSIS package development using BIDS is accomplished
with wizards and visual tools in the integrated development environment. We will use BIDS
for the majority of our subsequent explanations and examples for SSIS. Just as with SSAS
OLAP cubes and data mining models, as we explore SSIS we’ll use the sample SSIS packages
available for download at http://www.CodePlex.com. These packages revolve around the
AdventureWorksDW scenario that we have already been using.
440 Part III Microsoft SQL Server 2008 Integration Services for Developers

If you have a full version of Microsoft Visual Studio 2008 installed on your development com-
puter and you subsequently install SSIS, the SSIS templates will be installed to your existing
Visual Studio instance.

upgrading Integration Packages from an earlier Version


of SQL Server
BIDS includes a wizard called the SSIS Package Upgrade Wizard that will automati-
cally open if you attempt to open an SSIS package designed in SQL Server 2005. This
wizard only launches the first time you attempt to open a SQL Server 2005 package.
Thereafter, BIDS automatically loads the packages and gives you the choice to upgrade
those packages upon processing. This wizard includes a number of options, including
the ability to assign a password, a new GUID, and more. The default options include
updating the connection string, validating the package, creating a new package ID,
continuing with the package upgrade when a package upgrade fails, and backing up
the original package. Like many other SQL Server 2008 wizards, this one includes a final
List Of Changes dialog box, the ability to move back at any time, and a final dialog box
showing conversion steps and the success or failure of each step.

If you have packages created in SQL Server 2000, you can attempt to upgrade them
or you can install the backward-compatible run-time engine for DTS packages and run
those packages using that engine.

DTEXEC and DTEXECUI


After you create packages by using BIDS, they will be executed. Although packages can be
executed and debugged within the BIDS environment, this is generally only the case during
the development phase of a BI project. Executing packages through BIDS can be significantly
slower, largely because of the overhead of updating the graphical designers as the package
executes. Therefore, you should use another utility in production scenarios, and that utility
is the DTEXEC.exe command-line utility. Like many SQL Server command-line tools, DTEXEC
provides a wide variety of options and switches. This capability is very powerful and allows
you to configure advanced execution settings. We’ll cover these capabilities as we continue
through this section of the book. For now, we’ll get started with an example. The following
switches execute a package from the file system, specify an additional XML configuration file,
specify that validation warnings should be treated as errors, and specify that errors and infor-
mation events should be reported by the runtime to any logging providers:

DTEXEC.EXE /FILE "C:\Package.dtsx" /CONFIGFILE "C:\ConfigFile.dtsConfig" /WARNASERROR


/REPORTING EI

Fortunately, SSIS developers do not need to memorize all of these command-line options, or
even look them up very often—DTEXECUI.exe, also known as the Execute Package Utility, is
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 441

available for this. As its name implies, DTEXECUI is a graphical utility that can execute pack-
ages, but it has additional capabilities as well. As shown in Figure 14-3, DTEXECUI allows
developers and administrators to select execution options through a variety of property
pages and be presented with the DTEXEC command-line switches necessary to achieve the
same results through the DTEXEC command-line utility.

FIgure 14-3 DTEXECUI—the Execute Package Utility

Because many SSIS packages are executed as part of a scheduled batch process, these two
utilities complement each other very well. You can use the DTEXECUI utility to generate the
command-line syntax, and then execute it using DTEXEC as part of a batch file or SQL Server
Agent job.

DTUTIL
DTUTIL.exe is another command-line utility; it is used for SSIS package deployment and
management. With DTUTIL, developers and administrators can move or copy packages to
the msdb database, to the SSIS Package Store (which allows you to further group packages
into child folders viewable in SSMS), or to any file system folder. You can also use DTUTIL to
encrypt packages, set package passwords, and more. Unlike DTEXEC, DTUTIL has no cor-
responding graphical utility, unfortunately, but SQL Server Books Online contains a compre-
hensive reference to the available switches and options.
442 Part III Microsoft SQL Server 2008 Integration Services for Developers

The Integration Services Object Model and Components


Although the tools included with SSIS provide a great deal of out-of-the-box functionality, on
occasion these tools do not provide everything that SSIS developers need. For these situa-
tions, SSIS provides a complete managed API that can be used to create, modify, and execute
packages. The .NET API exposes the SSIS runtime, with classes for manipulating tasks, con-
tainers, and precedence constraints, and the SSIS data flow pipeline, with classes for manipu-
lating sources, transformations, and destinations.

Through the SSIS API, developers can also build their own custom tasks, transformations,
data sources and destinations, connection managers, and log providers. SSIS also includes a
Script task for including custom .NET code in the package’s control flow and a Script compo-
nent for including custom .NET code in the package’s data flow as a source, destination, or
transformation. In addition, new to SQL Server 2008 is the ability to write custom code in C#.
(Previously you were limited to using Microsoft Visual Basic .NET). In Chapter 19, “Extending
and Integrating SQL Server 2008 Integration Services,” we’ll take a closer look at working with
the SSIS API.

Now we’ll consider the physical makeup of an SSIS package. SSIS packages are stored as XML
files, typically with a .dtsx file extension, but because of the flexibility that SSIS provides for
building, executing, and storing packages, it’s not always that straightforward. Despite this,
if you think of a C# application as being made up of .cs files, thinking of SSIS applications as
being made up of .dtsx files isn’t a bad place to start. In the chapters ahead we’ll see that this
isn’t strictly true, but for now it’s more important to start on familiar footing than it is to be
extremely precise. In this chapter we’ll cover some of the most common scenarios around
SSIS packages, and in later chapters we’ll dive into some of the less-used (but no less power-
ful) scenarios.

SSIS packages are made up of the following primary logical components:

■■ Control flow
■■ Data flow
■■ Variables
■■ Expressions
■■ Connection managers
■■ Event handlers and error handling

Control Flow
The control flow is where the execution logic of the package is defined. A package’s control
flow is made up of tasks that perform the actual work of the package, containers that provide
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 443

looping and grouping functionality, and precedence constraints that determine the order
in which the tasks are executed. Figure 14-4 shows the control flow from a real-world SSIS
package; each rectangle is an individual task, and the tasks are connected by colored arrows,
which are precedence constraints. Notice that in our example we’ve followed the important
development best practice of using meaningful task names along with detailed task anno-
tations. We’ll go into much greater detail in later chapters, but to get you started here are
some of the included tasks: Bulk Insert, Data Flow, Execute Process, Execute SQL, Script, and
XML. Some new tasks have been introduced in the SQL Server 2008, most notably the Data
Profiling task.

FIgure 14-4 A sample control flow

You can think of a package’s control flow as being similar to the “main” routine in traditional
procedural code. It defines the entry point into the package’s logic and controls the flow of
that logic between the package’s tasks, not unlike the way the “main” routine controls the
flow of program logic between the program’s subroutines in a traditional Visual Basic or
C# program. Also, you have quite a few options for configuring the precedence constraints
between tasks. These include conditional execution, such as Success, Failure, or Completion,
and execution based on the results of an expression. As with much of the information intro-
duced here, we’ll be spending quite a bit more time demonstrating how this works and
suggesting best practices for the configuration of these constraints and expressions in later
chapters.
444 Part III Microsoft SQL Server 2008 Integration Services for Developers

As you can probably imagine, some control flows can become quite complex, but this is fre-
quently not necessary. Many real-world packages have very simple control flow logic; it’s the
data flow where things start to get really interesting.

Data Flow
The data flow is where a package’s core ETL functionality is implemented. While each pack-
age has one primary control flow, a package can have zero, one, or many data flows. Each
data flow is made up of data source components that extract data from a source system and
provide that data to the data flow, transformations that manipulate the data provided by the
data source components, and data destination components that load the transformed data
into the destination systems.

It is very important that you understand that SSIS attempts to perform all data transforms
completely in memory, so that they execute very quickly. A common task for SSIS package
designers for BI solutions is to ensure that package data flows are designed in such a way that
transforms can run in the available memory of the server where these packages are executed.

Figure 14-5 shows a portion of the data flow from a real-world SSIS package; each rectangle
is a data source component, transformation component, or data destination component. The
rectangles are connected by colored arrows that represent the data flow path that the data
follows from the source to the destination. On the data flow designer surface you use green
arrows to create a path for good data rows and red arrows to create a path for bad data
rows. Although you can create multiple good or bad output path configurations, you will
receive a warning if you try to configure a data flow transformation or destination before you
attach output rows (whether good or bad rows) to it. The warning reads “This component
has no available input rows. Do you wish to continue editing the available properties of this
component?”

Although a package’s control flow and data flow share many visual characteristics (after all,
they’re both made up of rectangles connected by colored arrows) it is vitally important to
understand the difference between the two. The control flow controls the package’s execu-
tion logic, while the data flow controls the movement and transformation of data.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 445

FIgure 14-5 Data flow

Tasks vs. Components


In an Integration Services package, tasks and components are very different. As men-
tioned earlier, tasks are the items that make up a package’s control flow logic, and com-
ponents are the items that make up a package’s data flow logic that is implemented in
a Data Flow task. This makes components something like properties on a task, or the
value members in a collection property on a task.

Why is this distinction so important? Many aspects of SSIS are implemented at the task
level. You can enable and configure logging for each task. You can define event han-
dlers for a task. You can add expressions to the properties on a task. You can enable
and disable tasks. But none of these activities is available for components. And to make
matters worse, quite a few publications will casually refer to components as “tasks,”
confusing the reader and muddying the waters around an already complicated topic.

Variables
Each SSIS package defines a common set of system variables that provide information (typi-
cally read-only information) about the package and its environment. System variables include
information such as PackageName and PackageID, which identify the package itself, and
446 Part III Microsoft SQL Server 2008 Integration Services for Developers

ExecutionInstanceGUID, which identifies a running instance of the package. The values of


these variables can be very useful when you create an audit log for the packages that make
up your ETL solution. Audit logs are a common business requirement in BI solutions.

In addition to these common system variables, package developers can create any number of
user variables that can then be used by the package’s tasks, containers, and data flow com-
ponents to implement the package’s logic. SSIS package variables play essentially the same
role as do variables in a traditional application: Tasks can assign values to variables; other
tasks can read these values from the variables. In fact, variables are the only mechanism that
SSIS provides through which tasks can share information.

SSIS vs. DTS: Task Isolation


The fact that you can only use variables to share information between tasks in an SSIS
package marks one of the major differences between SSIS and DTS. In DTS, it was pos-
sible (and unfortunately quite common) to modify properties of packages and package
components from code in other package components. This provided a great deal of
flexibility to DTS developers, but also made maintaining DTS packages problematic.
Essentially, everything within a DTS package was defined with a global scope so that
it could be accessed from anywhere within a package, and this resulted in “spaghetti
code” (interspersed program logic rather than cleanly separated units of logic) often
being the norm in DTS applications.

SSIS, on the other hand, was designed from the ground up to be a true enterprise
ETL platform, and one aspect of this design goal was that SSIS solutions needed to be
maintainable and supportable. By preventing tasks from making arbitrary changes to
other tasks, the SSIS platform helps prevent common errors—such as erroneously alter-
ing the value of a variable—and simplifies package troubleshooting and maintenance.

Figure 14-6 shows the Variables window from an SSIS package in BIDS. To open this window,
choose Other Windows on the View menu, and then select Variables. Click the third icon on
the toolbar to show the system variables. User variables have a blue icon; system variables
(which are hidden by default) are shown with a gray icon.

FIgure 14-6 The Variables window


Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 447

Just as in traditional programming environments, each variable in SSIS has a well-defined


scope and cannot be accessed from outside that scope. For example, a variable defined at
the package scope can be accessed from any task or transformation within the package
(assuming that the task or transformation in question knows how to access variables). But a
variable defined within the scope of a Data Flow task can only be accessed from transforma-
tions within that data flow; it cannot be accessed from any other tasks in the package or from
transformations in any other data flows in the package. This results in better quality executa-
bles that are easier to read, maintain, and understand. It is also important to understand that
when a variable is created, its scope is defined and that scope cannot be changed. To change
the scope of a variable, you have to drop and re-create it.

Expressions
One of the most powerful but often least understood capabilities provided by the SSIS run-
time is support for expressions. According to SQL Server Books Online, “expressions are a
combination of symbols (identifiers, literals, functions, and operators) that yields a single data
value.” This is really an understatement. As we’ll see, expressions in SSIS provide functional-
ity that is equally difficult to quantify until you see them in action, and the sometimes dry
content in SQL Server Books Online doesn’t really do much to explain just how powerful
they are. In DTS some of the functionality available in SSIS expressions was called Dynamic
Properties, but in SQL Server 2005 and later, SSIS expressions significantly increased the
capabilities of associating properties with expressions.

SSIS expressions provide a powerful mechanism for adding custom declarative logic to
your packages. Do you need to have the value of a variable change as the package exe-
cutes, so that it reflects the current state of the running package? Just set the variable’s
EvaluateAsExpression property to true and set its Expression property to an SSIS expression
that yields the necessary value. Then, whenever the variable is accessed, the expression is
evaluated and its value is returned. If you don’t find this exciting, consider this: The same
thing that you can do with variables you can also do with just about any property on just
about any task in your package. Expressions allow you to add your own custom logic to pre-
built components, extending and customizing their behavior to suit your needs. To draw an
analogy to .NET development, this is like being able to define your own implementation for
the properties of classes that another developer built, without going to the trouble of writing
a child class to do it.

Figure 14-7 shows the built-in Expression Builder dialog box. We’ll use this tool in later chap-
ters as we review the process and techniques we commonly use to associate custom expres-
sions with aspects of SSIS packages.

We’ll go into more depth on expressions and how to use them in the chapters ahead, but
they are far too important (and far too cool) to not mention as early as possible.
448 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 14-7 The Expression Builder dialog box

Connection Managers
The next part of a package that we’re going to look at in this section is the connection man-
ager. A connection manager is a logical representation of a connection. It acts as a wrap-
per around a physical database connection (or a file connection, an FTP connection, and so
on—it all depends on the type of connection manager) and manages access to that connec-
tion both at design time when the package is being built and at run time when the package
executes.

Another way to look at connection managers is as the gateway through which tasks and
components access resources that are external to the package. For example, whenever an
Execute SQL task needs to execute a stored procedure, it must use a connection manager
to connect to the target database. Whenever an FTP task needs to download data files from
an FTP server, it must use a connection manager to connect to the FTP server and another
connection manager to connect to the file location to which the files will be copied. In short,
whenever any task or component within an SSIS package needs to access resources outside
the package, it does so through a connection manager. Because of this, connection manag-
ers are available practically everywhere within the SSIS design tools in BIDS. The Connection
Managers window, as shown in Figure 14-8, is included at the bottom of the control flow
designer, at the bottom of the data flow designer, and at the bottom of the event handler
designer.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 449

FIgure 14-8 Connection managers

The Connection Manager exception


Using a connection manager for each external access is a good rule of thumb to keep
in mind, but this rule has several exceptions. One example is when you use the Raw File
source and destination data flow components. These are used to read from and write
to an SSIS native raw file format for data staging. Each of these components has an
Access Mode property through which the raw file can be identified by file name directly
or by reading a file name variable, instead of using a connection manager to reference
the file. Several such exceptions exist, but for the most part the rule holds true: If you
want to access an external resource, you’re going to do it through a connection man-
ager. You can see in Figure 14-9 that you have a broad variety of connection types to
select from.

FIgure 14-9 Connection manager types


450 Part III Microsoft SQL Server 2008 Integration Services for Developers

Although the gateway aspect of connection managers may sound limiting, this design offers
many advantages. First and foremost, it provides an easy mechanism for reuse. It is very
common to have multiple tasks and data flow components within a package all reference the
same SQL Server database. Without connection managers, each component would need to
use its own copy of the connection string, making package maintenance and troubleshoot-
ing much trickier. Connection managers also simplify deployment: Because anything that is
location-dependent (and may therefore need to be updated when deploying the package
into a different environment) if a package is managed by a connection manager, it is simple
to identify what needs to be updated when moving a package from development to test to
production environments. It is interesting to note that the Cache connection manager is new
to SQL Server 2008. We’ll cover why it was added and how you use it in later chapters.

Tip You can download additional connection managers from utility locations (such as CodePlex),
purchase them, or develop them by programming directly against the SSIS API.

Event Handlers and Error Handling


If a package’s control flow is like the “main” routine in traditional procedural code, a pack-
age’s event handlers are similar to the event procedures written as part of a traditional appli-
cation to respond to user interaction, such as a button-click event. Similar to the components
of a control flow, an SSIS event handler is made up of tasks, containers, and precedence
constraints, but while each package has only one main control flow, a package can have
many event handlers. Each event handler is attached to a specific event (such as OnError or
OnPreExecute) for a specific task or container or for the package itself. In Figure 14-10 you
can see a list of available event handlers for SSIS. The most commonly used event handler
is the first one, OnError. We will drill down in subsequent chapters, examining business sce-
narios related to BI projects where we use other event handlers.

FIgure 14-10 SSIS event handlers


Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 451

As you can see in Figure 14-11, the UI for defining and managing event handlers in an SSIS
package is very similar to the tools in Visual Studio for managing event handlers in Visual
Basic or C# code. The biggest difference (other than the fact that the event handler has a
control flow designer surface for tasks and precedence constraints instead of code) is the
Executable drop-down list. Instead of selecting from a simple list of objects, in SSIS you select
from a tree of the available containers and tasks, reflecting the hierarchical nature of the
executable components in an SSIS package.

FIgure 14-11 Event handlers

In addition to defining event handlers for various conditions, including the OnError event
in the control flow, SSIS provides you with an easy-to-use interface to configure the desired
behavior for errors that occur in the data flow. You can specify error actions in sources, trans-
formations, and destinations in the data flow. Figure 14-12 shows an example of configuring
the error output for an OLE DB connection. Notice that you are presented with three action
options at the column error and truncation level: Fail Component, Ignore Failure, or Redirect
Row. If you select the last option, you must configure an error row output destination as part
of the data flow.
452 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 14-12 Error output for an OLE DB data source

The easy configuration of multiple error handling scenarios in SSIS packages is an extremely
important feature for BI solutions because of the large (even huge) volumes of data that are
common to work with for both OLAP cubes and data mining structures.

The Integration Services runtime


The SSIS runtime provides the necessary execution context and services for SSIS packages.
Similar to how the .NET common language runtime (CLR) provides an execution environment
for ASP.NET and Windows Forms applications—with data types, exception handling, and core
platform functionality—the SSIS runtime provides the execution context for SSIS packages.
When an SSIS package is executed, it is loaded into the SSIS runtime, which manages the exe-
cution of the tasks, containers, and event handlers that implement the package’s logic. The
runtime handles things such as logging, debugging and breakpoints, package configurations,
connections, and transaction support.

For the most part, SSIS package developers don’t really need to think much about the SSIS
runtime. It just works, and if you’re building your packages in BIDS, you can often take it for
granted. The only time that SSIS developers generally need to interact directly with the run-
time is when using the SSIS .NET API and the classes in the Microsoft.SqlServer.Dts.Runtime
namespace, but understanding what’s going on under the hood is always important if you’re
going to get the most out of SSIS (or any development platform).
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 453

The Integration Services Data Flow engine


While the SSIS runtime is responsible for managing the execution of packages’ control flow,
the heart of the SSIS data flow is the data flow engine. The data flow engine is responsible for
interacting with data source components to extract data from flat files, relational databases,
and other sources; for managing the transformations that manipulate the data that flows
through the pipeline; and for interacting with the data destination components that load
data into destination databases or other locations. But even more important than this, the
data flow engine is responsible for managing the memory buffers that underlie the work that
the source, destination, and transformation components are doing.

Note Why is the data flow engine also known as the pipeline? Early in the SSIS development
cycle the term pipeline was used for what became known as the Data Flow task; this was due to
its logical pipeline design. These days everyone knows it as the Data Flow task, but the name
lives on in the Microsoft.SqlServer.Dts.Pipeline.Wrapper namespace, through which developers
can interact with the core SSIS data flow functionality.

In the previous section we introduced the SSIS runtime, briefly explained its purpose and
function, and said that most SSIS developers could take it for granted, and then we moved
on. We’re not going to do the same thing with the data flow engine. Although the data flow
engine is also largely hidden from view, and although many SSIS developers do build pack-
ages without giving it much thought, the data flow engine is far too important to not go into
more detail early on. Why is this?

It is due largely to the type of work that most SSIS packages do in the real world. Take a look
at the control flow shown earlier in Figure 14-4. As you can see, this package performs the
following basic tasks:

■■ Selects a row count from a table


■■ Inserts a record into another table and captures the identity value for the new row
■■ Moves millions of records from one database to another database
■■ Selects a row count from a table
■■ Updates a record in a table

This pattern is very common for many real-world SSIS packages: A handful of cleanup tasks
perform logging, notification, and similar tasks, with a Data Flow task that performs the core
ETL functionality. Where do you think the majority of the execution time—and by extension,
the greatest opportunity for performance tuning and optimization—lies in this package?

The vast majority of the execution time for most SSIS packages in real-world BI projects is
spent in the Data Flow task, moving and transforming large volumes of data. And because of
454 Part III Microsoft SQL Server 2008 Integration Services for Developers

this, the proper design of the Data Flow task generally matters much more than the rest of
the package, assuming that the control flow correctly implements the logic needed to meet
the package’s requirements.

Therefore, it is important that every SSIS developer understands what’s going on in the data
flow engine behind the Data Flow task; it’s very difficult to build a data flow that performs
well for large volumes of data without a solid understanding of the underlying architecture.
Let’s take a closer look at the data flow engine by looking at some of its core components,
namely buffers and metadata, and at how different types of components work with these
components.

Data Flow Buffers


Although the Data Flow task operates as a logical pipeline for the data, the underlying imple-
mentation actually uses memory buffers to store the data being extracted from the data
flow’s sources (transformed as necessary) and loaded into the data flow’s destinations. Each
data source component in a data flow has a set of memory buffers created and managed by
the data flow engine; these buffers are used to store the records being extracted from the
source and manipulated by transformations before being loaded into the destination.

Ideally, each buffer will be stored in RAM, but in the event of a low memory condition, the
data flow engine will spool buffers to disk if necessary. Because of the high performance
overhead of writing buffers to disk (and the resulting poor performance), avoid this whenever
possible. We’ll look at ways to control data flow buffers in a later chapter as part of our cov-
erage of performance tuning and optimization.

In SQL Server 2008, the data flow pipeline has been optimized for scalability, enabling
SSIS to more efficiently utilize multiple processors (specifically, more than two CPUs) avail-
able in the hardware environment where SSIS is running. This works by implementing more
efficient default thread scheduling. This improved efficiency allows you to spend less time
performance tuning the package (by changing buffer size settings, for example). Automatic
parallelism and elimination of thread starvation and deadlocks results in faster package
execution.

Data Flow Metadata


The properties of the buffers described earlier depend on a number of factors, the most
important of which is the metadata of the records that will pass through the data flow. The
metadata is generally determined by the source query used to extract data from the source
system. (We’ll see exceptions to this in the next few sections, but we’ll look at the most com-
mon cases first.) For example, consider a data flow that is based on the following query
against the AdventureWorks2008 database. The data type of each column in the source table
is included for reference only.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 455
SELECT [BusinessEntityID] -- INT
,[NationalIDNumber] -- NVARCHAR (15)
,[LoginID] -- NVARCHAR (256)
,[JobTitle] -- NVARCHAR (50)
,[HireDate] -- DATETIME
,[rowguid] -- UNIQUEIDENTIFIER
,[ModifiedDate] -- DATETIME
FROM [HumanResources].[Employee]
WHERE [ModifiedDate] > '2008-02-27 12:00:00.000'

Each row in the data flow buffer created to hold this data will be 678 bytes wide: 4 bytes for
the INT column, 8 bytes for each DATETIME column, 16 bytes for the UNIQUEIDENTIFIER, and
2 bytes for each character in each NVARCHAR column. Each column in the buffer is strongly
typed so that the data flow engine knows not only the size of the column, but also what data
can be validly stored within. This combination of columns, data types, and sizes defines a
specific type for each buffer, not unlike how a combination of columns, data types, and sizes
defines a specific table TYPE in the SQL Server 2008 database engine.

The data flow designer in BIDS gives SSIS developers a simple way to view this metadata.
Simply right-click any data flow path arrow and select Edit from the shortcut menu. In the
Path Metadata pane of the Data Flow Path Editor you can see the names, types, and sizes of
each column in the buffer underlying that part of the data flow. Figure 14-13 shows the Data
Flow Path Editor for the preceding sample query.

FIgure 14-13 Data flow metadata


456 Part III Microsoft SQL Server 2008 Integration Services for Developers

It’s worth noting, however, that the metadata displayed in this dialog box is not always the
complete metadata for the underlying buffer. Consider the data flow shown in Figure 14-14.

FIgure 14-14 Adding a Derived Column transformation

In this example, a Derived Column transformation has been added to the data flow. This
transformation adds a new column, AuditKey, to the data flow. This is a common technique
for BI projects, because knowing where the data came from is often almost as important as
the data itself. If we were to examine the metadata for the data flow path between the OLE
DB Source component and the Derived Column transformation after making this change,
it would be identical to what is shown in Figure 14-14. At first glance this makes sense. The
AuditKey column is not added until the Derived Column transformation is reached, right?

Wrong. This is where the difference between the logical pipeline with which SSIS developers
interact and the physical buffers that exist under the covers becomes evident. Because the
same buffer is used for the entire data flow (at least in this example—we’ll see soon how this
is not always the case), the additional 4 bytes are allocated in the buffer when it is created,
even though the column does not exist in the source query. The designer in BIDS is intelligent
enough to hide any columns in the buffer that are not in scope for the selected data flow
path, but the memory is set aside in each row in the buffer. Remember that the bulk of the
performance overhead in BI projects occurs in the data flow, and by association proper use of
the buffer is very important if you want to create packages that perform well at production
scale.

Variable Width Columns


It’s worth noting that the SSIS data flow has no concept of variable width columns. Therefore,
even though the columns in the preceding example are defined using the NVARCHAR data
type in the source system and are thus inherently variable length strings, the full maximum
width is always allocated in the data flow buffer regardless of the actual length of the value in
these columns for a given row.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 457

Why is this? Why does SSIS pass up what appears to be such an obvious opportunity for opti-
mization, when the SQL Server database engine has done so for years?

The answer can be found in the different problems that SQL Server and SSIS are designed
to solve. The SQL Server database stores large volumes of data for long periods of time, and
needs to optimize that storage to reduce the physical I/O involved with reading and writing
the data. The more records that can be read or written in a single I/O operation, the better
SQL Server can perform.

The SSIS data flow, on the other hand, is designed not to store data but to perform trans-
formations on the data in memory. To do this as efficiently as possible, the data flow engine
needs to know exactly where to find each field in each record. To do that as efficiently as
possible, it can’t worry about tracking the length of each field in each record. Instead, by
allocating the full size for each column, the SSIS data flow engine can use simple pointer
arithmetic (which is incredibly fast) to address each record. In the preceding example, if the
first record in the buffer is located at memory address 0x10000, the second record is located
at 0x102A6, the third is located at 0x1054C, and the LoginID column of the fourth record is
located at 0x67604. Because each column width is fixed, this is all perfectly predictable, and it
performs like a dream.

All of this strict typing in the buffer has both pros and cons. On the plus side, because SSIS
knows exactly what data is stored where at all times it can transform data very quickly, yield-
ing excellent performance for high-volume data flows. This is the driving reason that SSIS is
so picky about data types: The SSIS team knew that for SSIS to be a truly enterprise-class ETL
platform, it would have to perform well in any scenario. Therefore, they optimized for perfor-
mance wherever possible.

The negative side of the strict data typing in the SSIS data flow is the limitations that it places
on the SSIS developer. Implicit type conversions are not supported; not even narrowing con-
versions can be performed implicitly. This is a common cause for complaint for developers
who are new to SSIS. When you’re converting from a 4-byte integer to an 8-byte integer,
there is no potential for data loss, so why can’t SSIS “just do it?” You can do this in C#—why
not in SSIS?

The answer lies in the problems that each of these tools was designed to solve. C# is a
general-purpose programming language where developer productivity is a key requirement.
SSIS is a specialized tool where it is not uncommon to process tens of millions of records at
once. The run-time overhead involved with implicit data type conversion is trivial when deal-
ing with a few values, but it doesn’t scale particularly well, and SSIS needs to scale to handle
any data volume.
458 Part III Microsoft SQL Server 2008 Integration Services for Developers

How Integration Services uses Metadata


Many of the most commonly asked questions about working with the SSIS data flow
task arise from a lack of understanding of how SSIS uses metadata. This is especially the
case for developers with a DTS background. DTS was much more flexible and forgiving
around metadata, which often made it slower than SSIS but also made it easier to work
with data from less-structured data sources such as Microsoft Office Excel workbooks.
SSIS needs complete, consistent, stable metadata for everything in the data flow, and
can often be unforgiving of changes to data sources and destinations because its meta-
data is no longer in sync with the external data stores. The next time you find yourself
asking why a data flow that was working yesterday is giving you warnings and errors
today, the first place you should generally look is at the metadata.

In the chapters ahead, we will revisit the topic of data flow metadata many times—it is the
foundation of many of the most important tasks that SSIS developers perform. If it is impor-
tant for .NET developers to understand the concepts of value types and reference types in
the CLR, or for SQL Server developers to understand how clustered and nonclustered indexes
affect data access performance, it is equally important for SSIS developers to understand the
concepts of metadata in the SSIS data flow. SQL Server Books Online has additional coverage
of SSIS metadata—look for the topic “Data Flow Elements.”

So far we’ve been assuming that all of the buffers for a given data flow have the same meta-
data (and therefore are of the same type) and that the same set of buffers will be shared by
all components in the data flow. Although this is sometimes the case, it is not always true. To
understand the details, we need to take a look at the two different types of outputs that SSIS
data flow components can have: synchronous and asynchronous.

Synchronous Data Flow Outputs


A data flow component with synchronous outputs is one that outputs records in the same
buffer that supplied its input data. Consider again the data flow shown in Figures 14-4 and
14-5. You can see that the Derived Column transformation is adding a new AuditKey column
to the data flow. Because the Derived Column transformation has one synchronous output,
the new column value is added to each row in the existing data flow buffer.

The primary advantage to this approach is that it is incredibly fast. No additional memory is
being allocated and most of the work being performed can be completed via pointer arith-
metic, which is a very low-cost operation. Most of the workhorse transformations that are
used to build the ETL components of real-world BI applications have synchronous outputs,
and it is common to design data flows to use these components wherever possible because
of the superior performance that they offer.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 459

Asynchronous Data Flow Outputs


The disadvantage to synchronous outputs is that some operations cannot be performed on
existing records in an existing memory buffer. For these operations, the SSIS data flow also
supports asynchronous outputs, which populate different data flow buffers than the buffers
that provide the input records for the data flow component. For example, consider a data
flow that uses an Aggregate transformation and performs the same type of operations as the
SQL GROUP BY clause.

This Aggregate transformation accepts records from an input buffer that contains detailed
sales information, with one input record for each sales order detail line. The output buf-
fer contains not only different columns with different data types, but also different records.
Instead of producing one record per sales order detail line, this transformation produces one
record per sales order.

Because data flow components with asynchronous outputs populate different memory buf-
fers, they can perform operations that components with synchronous outputs cannot, but
this additional functionality comes with a price: performance. Whenever a data flow compo-
nent writes records to an asynchronous output, the memory must be physically copied into
the new buffer. Compared to the pointer arithmetic used for synchronous outputs, memory
copying is a much costlier operation.

Note You might see the terms synchronous and asynchronous used to describe data flow trans-
formations and not their outputs. A single data flow component (and not just transformations)
can have multiple outputs, and each asynchronous output is going to have its own buffers cre-
ated by the data flow engine. Many publications (including parts of the SSIS product documenta-
tion) regularly refer to synchronous and asynchronous transformations because that is the most
common level of abstraction used when discussing the SSIS data flow. You only need to be con-
cerned about this level of detail when you’re building a custom data flow component through
code, or when you’re really digging into the internals of a data flow for performance tuning and
optimization.

Log Providers
Log providers are SSIS components that, as their name implies, log information about pack-
age execution. Logging is covered in more depth in a later chapter; for now it’s sufficient to
understand that each log provider type is responsible for writing this information to a differ-
ent destination. SSIS includes log providers for logging to the Windows Event Log, text files,
XML files, SQL Server, and more. In addition, .NET developers can create their own custom
log providers by inheriting from the Microsoft.SqlServer.Dts.Runtime.LogProviderBase base
class. Logging can be enabled very granularly—at the level of a specific task and interesting
460 Part III Microsoft SQL Server 2008 Integration Services for Developers

events for that task—rather than globally. In this way logging is both efficient and effective.
Figure 14-15 shows a list of some of the events that can be logged from the configuration
dialog box.

FIgure 14-15 SSIS logging events

You’ve probably noticed a trend by now: Practically any type of component that makes up
the SSIS architecture can be developed by .NET developers using the SSIS managed API. This
is one of the core aspects of SSIS that makes it a true development platform—in addition
to all of the functionality that is provided out of the box, if you need something that is not
included, you can build it yourself. Although SSIS development deserves a complete book of
its own, we will take a look at it in greater depth in Chapter 19.

Deploying Integration Services Packages


SSIS provides many different features and options related to package deployment—so
many, in fact, that Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration
Services,” is devoted almost entirely to deployment. One of the most important things to
keep in mind is that a smooth deployment begins not when the package development is
complete, but when package development is beginning, through the proper use of package
configurations.
Chapter 14 Architectural Components of Microsoft SQL Server 2008 Integration Services 461

Package Configurations
Package configurations are an SSIS feature that allows you to store the values for properties
of package tasks outside of the package itself. Although the details are different, configura-
tions are conceptually very similar to using a web.config file in an ASP.NET application: The
values that need to change as the application moves from development to test to production
(usually database connection strings and file/folder paths) are stored externally to the pack-
age code so that they can be updated separately without requiring the application logic to
be modified.

Package configurations are implemented by the SSIS runtime, which reads the configuration
values when the package is being loaded for execution and applies the values from the con-
figuration to the task properties that have been configured. SSIS lets you select from a variety
of different configuration types, including XML files, SQL Server databases, the Windows reg-
istry, and Windows environment variables. SSIS also supports indirect configurations, where
you store the path to an XML file (or the registry path or SQL Server connection string) in a
Windows environment variable and then reference the environment variable from the SSIS
package being configured.

The primary goal of package configurations is to make the packages location-independent.


To have a smooth deployment, any reference that an SSIS package makes to any external
resource should be included in a package configuration. That way, when either the package
or the external resource is moved, only the configuration information must be updated; the
package need not be changed.

Package Deployment Options


Just as SSIS provides a variety of location options for storing package configuration informa-
tion, it also provides a variety of location options for package deployment. SSIS packages
can be deployed either to the file system or to a SQL Server instance. As mentioned earlier,
SSIS includes utilities to simplify deployment, including DTEXEC.exe, DTEXECUI.exe, and
DTUTIL.exe.

If packages are deployed to a SQL Server instance, the packages are stored in the dbo.sys-
ssispackages table in the msdb system database. This allows packages to be backed up and
restored with the system databases, which is often desirable if the database administrators
responsible for maintaining the SQL Server instances used by the SSIS application are also
responsible for maintaining the SSIS application itself.

If packages are deployed to the file system, they can be deployed to any location as DTSX
files, or they can be deployed to a default folder that the SSIS service monitors. (This folder
is called the SSIS Package Store and is located by default at C:\Program Files\Microsoft SQL
Server\100\DTS\Packages.) If packages are deployed to this default folder, the packages can
be monitored using SSMS, as described in the previous section on SSIS tools and utilities.
462 Part III Microsoft SQL Server 2008 Integration Services for Developers

Keep in mind that although you have multiple deployment options, no single option is best.
The deployment options that work best for one project may not be best for another—it all
depends on the context of each specific project.

Summary
In this chapter we took an architectural tour of the SSIS platform. We looked at packages and
the SSIS runtime, which is responsible for package execution. We examined control flow and
data flow—the two primary functional areas within each package—and the components that
SSIS developers use to implement the specific logic needed in their ETL projects. We looked
at the data flow pipeline that delivers the core high-performance ETL functionality that
makes SSIS ideally suited to enterprise-scale data-warehousing projects. We looked at the
tools and utilities included with the SSIS platform for developing, deploying, and managing
SSIS applications, and we looked at the .NET object model that enables developers to extend
and automate practically any aspect of SSIS. And we’ve just scraped the surface of what SSIS
can do.

We are sometime asked whether SSIS is required for all BI projects. The technical answer
is no. However, we have used these built-in ETL tools for every production BI solution that
we’ve designed, developed, and deployed—ever since the first release of SSIS in SQL Server
2005. Although you could make the case that you could accomplish BI ETL by simply writ-
ing data access code in a particular .NET language or by writing database queries against
RDBMS source systems, we strongly believe in using SSIS as the primary ETL workhorse for BI
solutions because of its graphical designer surfaces, self-documenting visual output, sophis-
ticated error handling, and programmatic flexibility. Another consideration is the features
that you’ll need for your solution. Do be aware of the significant feature differences in SSIS
between the Enterprise and Standard editions. Another consideration is that the only ETL tool
available in the Workgroup and Express editions is the Import/Export Wizard. No SSIS fea-
tures are included in those editions.

In the chapters ahead we’ll look at each of the topics introduced in this chapter in greater
depth, focusing both on the capabilities of the SSIS platform and best-practice techniques
for using it to develop and deliver real-world ETL solutions. In the next chapter we’ll focus on
using BIDS to create SSIS packages, so get ready to get your hands dirty!
Chapter 15
Creating Microsoft SQL Server 2008
Integration Services Packages
with Business Intelligence
Development Studio
In Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration Services,”
we looked at the major components that make up the Microsoft SQL Server 2008 Integration
Services (SSIS) platform, including packages, control flow, data flow, and more. Now we’re
going to work with the primary development tool used to create Integration Services pack-
ages: Business Intelligence Development Studio (BIDS). As an Integration Services developer,
you’ll spend the majority of your project time working in BIDS, so you need to understand
how to use the tools that BIDS provides to work with packages.

In this chapter, we examine the SSIS tools that BIDS adds to Microsoft Visual Studio, and how
each one is used when developing SSIS packages. We’ll also look at the workhorse compo-
nents that SSIS developers use when building a business intelligence (BI) application. We
won’t include a laundry list of tasks and transformations—SQL Server Books Online does an
excellent job of this. Instead, we’ll focus on the most commonly used components and how
they can best be used when developing real-world solutions with SSIS.

Integration Services in Visual Studio 2008


As you know by now, BIDS is the Visual Studio 2008 integrated development environment,
which includes new project types, tools, and windows to enable SQL Server BI development.
BIDS provides the core tools that SSIS developers use to develop the packages that make up
the extract, transform, and load (ETL) component of a BI application. But BIDS also represents
something more—it represents a major shift from how packages were developed using Data
Transformation Services (DTS) in the days of SQL Server 2000 and SQL Server 7.0.

One of the biggest differences between DTS and SSIS is the tools that package developers
use to build their ETL processes. In DTS, developers worked in SQL Server Enterprise Manager
to build DTS packages. Enterprise Manager is primarily a database administrator (DBA) tool,
so DBAs who needed to build DTS packages generally felt right at home. And because it was
often DBAs who were building the DTS packages, this was a good thing.

463
464 Part III Microsoft SQL Server 2008 Integration Services for Developers

But one major drawback of using Enterprise Manager for DTS package development is that
Enterprise Manager isn’t really a development tool—it’s a management tool. Because of this,
it was difficult to treat DTS packages as first-class members of an application. There was no
built-in support for source control, versioning, or deployment. Few of the tools and proce-
dures that were used to store, version, and deploy the other parts of the BI application (such
as source code, reports, SQL scripts, and so on) could be easily used with DTS packages, and
this was due in large part to the reliance of DTS on Enterprise Manager.

With SSIS, package development takes place in Visual Studio. This means that it’s now much
easier to integrate SSIS into the overall software development life cycle, because you can now
use the same tools and processes for your SSIS packages that you use for other project arti-
facts. This is also a good thing, but it too has its drawbacks. One drawback is that many DBAs
who are familiar with Enterprise Manager or SSMS might not be as familiar with Visual Studio.
So to take advantage of the capabilities of the SSIS platform, these DBAs need to also learn a
new—and often complex—set of tools.

If you’re a software developer who has spent a lot of time building software in Visual Studio,
the information in the next few sections might seem familiar to you, but please keep read-
ing. Although we’re going to talk about some familiar concepts such as solutions and projects
and some familiar tools such as Solution Explorer, we’re going to focus as much as possible
on the details that are specific to SSIS. The remainder of this section is an overview of how
Visual Studio is used to develop SSIS packages, including a tour of the project template and
the various windows and menus and tools in Visual Studio that are unique to SSIS.

Creating New SSIS Projects with the Integration Services


Project Template
The first step when building an ETL solution with SSIS is to create a new SSIS project in BIDS.
From the main menu in BIDS, select File, New, and then Project. You’ll be presented with the
New Project dialog box shown in Figure 15-1. This screen shot represents BIDS without an
existing full Visual Studio 2008 installation. When you are working only with BIDS, the only
project types available are Business Intelligence Projects and Other Project Types. If you have
a full installation of Visual Studio 2008, you would also have C# and Visual Basic .NET project
types.

In the New Project dialog box, select Business Intelligence Projects from the Project Types
section, and then select Integration Services Project from the Templates section. Enter a
name and location for your project and then click OK to create the SSIS project and a solu-
tion to contain it.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 465

FIgure 15-1 BIDS project templates

After you complete the New Project dialog box and click OK, you’ll see the contents of the
project created from the SSIS project template. Figure 15-2 shows a new SSIS project in
Visual Studio—this is the end product of the project template selected in the New Project
dialog box.

FIgure 15-2 SSIS project template in Visual Studio

In Visual Studio, a project is a collection of files and settings—for an SSIS project, these are
generally .dtsx files for the packages in the project and settings related to debugging and
deployment. A solution in Visual Studio is a collection of projects and settings. When you
466 Part III Microsoft SQL Server 2008 Integration Services for Developers

create a BI project using the SSIS template in BIDS, you can add items that are related to
SSIS development to your solution. These items include new data sources, data source views
(DSVs), regular SSIS packages, or SSIS packages that are created after running the SQL Server
Import And Export Wizard.

Viewing an SSIS Project in Solution Explorer


When you’re working in Visual Studio, you always have a solution that contains one or more
projects, and Visual Studio shows this structure in the Solution Explorer window. Figure 15-3
shows the Solution Explorer window for a solution containing a new SSIS project.

FIgure 15-3 Solution Explorer, showing a new SSIS project

Tip If you look in Solution Explorer and see only your project with no solution displayed, don’t
worry. By default, Visual Studio displays the solution node in the Solution Explorer window only if
there is more than one project in the solution. If you want to display the solution node regardless
of the number of projects in your solution, you can open the Options dialog box from the Tools
menu in Visual Studio. Select the Projects And Solutions page in the list on the left and choose
the Always Show Solution check box.

Within the Solution Explorer window, each SSIS project contains the same four folders: Data
Sources, Data Source Views, SSIS Packages, and Miscellaneous. Although data sources and
data source views are vital components in SSAS projects, they’re used much less often when
working with SSIS. Data sources allow connection properties to be shared between pack-
ages in an SSIS project, but they’re strictly a Visual Studio tool and are unrelated to the SSIS
runtime. Because of this, any packages that rely on data sources to set connection properties
during development need to be updated to use another mechanism (such as package con-
figurations) during deployment.

Tip Because of the restrictions just defined, we usually don’t use project-level data sources. We
prefer to use package-specific data sources that are defined in package configurations.

The SSIS Packages folder contains all the packages in the project; to add a new package,
right-click on the folder and select New SSIS Package from the shortcut menu. You can also
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 467

add an existing package to the project, import a DTS package, and more, all from the same
shortcut menu. The Miscellaneous folder is used to store any other files that are related to
the packages in the project. Often these are XML configuration files, sample data for use dur-
ing development, or documentation related to the project, but any type of file can be added
by right-clicking on the project (oddly enough, not by clicking on the Miscellaneous folder)
and selecting Add Existing Item from the shortcut menu.

Data Sources and Data Source Views


Data sources and data source views in SSIS function similarly to the way they work in
SSAS—that is, a data source creates a global connection, and a data source view allows
you to create abstractions (similar to views) using the source data. As in SQL Server
2008 Analysis Services (SSAS), in SSIS DSVs are most useful when you do not have per-
mission to create abstractions directly in the source data. It’s important that you under-
stand that changes in the underlying metadata of the source will not be automatically
reflected in the DSV—it must be manually refreshed. DSVs are not deployed when
SSIS packages are deployed. Also, DSVs can be used only from the Sources, Lookups,
or Destination components. Although there is some usefulness to both data source
and DSV objects, we generally prefer to use package configurations (which will be cov-
ered in Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008
Integration Services”) because we find configurations to provide us with a level of flex-
ibility that makes package maintenance and deployment simpler.

If you have experience building .NET Framework applications in Visual Studio, you’ll probably
notice that SSIS projects do not have all the same capabilities as C# or Visual Basic projects.
For example, files are not sorted alphabetically—instead, your packages are sorted in the
order in which they are added to the project. You can re-sort them by manually editing the
.dtproj file for the project, but there is no support for this built into Visual Studio itself. You
can, however, use the BIDS Helper tool available from CodePlex to add package sorting func-
tionality to your SSIS projects.

Also, you cannot add your own folders and subfolders to an SSIS project; the four folders
described earlier and shown in Figure 15-3 are all you get. Still, despite these differences,
working with SSIS in Visual Studio should be a productive and familiar experience all around.

Using the SSIS Package Designers


Once you’ve had a chance to look at the resources available in the Solution Explorer window,
the next logical step is to move on to the package designers themselves. The three primary
designers (control flow, data flow, and event handler) and a viewer (Package Explorer) for SSIS
packages are accessed on the tabs shown in Figure 15-4.
468 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 15-4 Tabs for the SSIS package designers

Each tab opens a different designer:

■■ Control flow designer The Control Flow tab opens the control flow designer, which
presents a graphical designer surface where you will build the execution logic of the
package using tasks, containers, and precedence constraints.
■■ Data flow designer The Data Flow tab opens the data flow designer, which presents
a graphical designer surface for each data flow task in your package. This is where you
will build the ETL logic of the package using source components, destination compo-
nents, and transformation components.
■■ Event handler designer The Event Handlers tab opens the event handler designer,
which presents a graphical designer surface where you can build custom control flows
to be executed when events are fired for tasks and containers within the package, or
for the package itself.

Note Each of the three designers includes a Connection Managers window. As you saw in
Chapter 14, connection managers are package components that are shared between all parts of
the package that need to use the external resources managed by the connection managers. This
is represented inside the Visual Studio designers by the inclusion of the Connection Managers
window on every designer surface where you might need them.

Unlike the Control Flow, Data Flow, and Event Handlers tabs, the Package Explorer tab
doesn’t open a graphical designer surface. Instead, it gives you a tree view of the various
components that make up the package, including tasks, containers, precedence constraints,
connection managers, data flow components, and variables, as shown in Figure 15-5.

Unlike the package designer tabs, the Package Explorer tab is not used to build packages.
Instead, it presents a single view where the entire structure of the package can be seen in
one place. This might not seem important right now, but SSIS packages can become pretty
complex, and their hierarchical nature can make them difficult to explore through the other
windows. For example, tasks can be inside containers, containers can be inside other con-
tainers, and there can be many Data Flow tasks within a package, each of which can contain
many components. Also, the event handler designer does not provide any way to see in one
place what event handlers are defined for what objects within the package, which is a limita-
tion that can easily hide core package functionality from developers who are not familiar
with the package’s design. The Package Explorer gives SSIS developers a single place to look
to see everything within the package, displayed in a single tree that represents the package’s
hierarchy.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 469

FIgure 15-5 Package Explorer tab in the SSIS designer

Although you cannot build a package by using the Package Explorer, it can still be a valu-
able jumping off point for editing an existing package. Many objects, such as tasks, prece-
dence constraints, and connection managers can be edited from directly within the Package
Explorer tab—just double-click on the object in the package explorer tree and the editor
dialog box for that component will appear. This holds true for Data Flow tasks as well; if you
double-click on a Data Flow task icon within the package explorer tree, the designer surface
for that Data Flow task will open.

Working with the SSIS Toolbox


When you’re working with the control flow, data flow, or event handler designers, Visual
Studio displays a Toolbox window that contains the components that you use to build your
package. The components displayed in the Toolbox are context sensitive: when you’re build-
ing a data flow, it contains transformations, source, and destination components; when you’re
building a control flow or event handler, the Toolbox contains tasks and containers, as shown
in Figure 15-6.

The Toolbox works the same way for SSIS projects as it does for other Visual Studio projects
with graphical designers (such as Windows Forms and Web Forms projects)—you simply drag
the components to the designer surface to implement the logic your package requires.
470 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 15-6 Control Flow Toolbox

Notice that when you try dragging items (both tasks and components) to the designer sur-
face, you receive immediate feedback about the status of the item as soon as you drop it
onto the designer surface. If a task or component requires additional configuration informa-
tion to execute successfully, either a red icon with a white X or a yellow triangle with a black
exclamation mark appears on the task or component (rectangle) itself. If you pass your mouse
over the task or component in the design window, a tooltip appears with more information
about what you must to do to fix the error. An example is shown in Figure 15-7.

FIgure 15-7 Design-time component error

It’s also possible that the design environment will display a pop-up dialog box with an error
message when you attempt to open, configure, or validate the default configuration of a
task. It’s important that you understand that the design environment does extensive design-
time validation. This is to minimize run-time package execution errors.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 471

Maintenance Plan Tasks


If you refer to Figure 15-6, you’ll notice that there is a second group of tasks at the bot-
tom, labeled Maintenance Plan Tasks. Database maintenance plans in SQL Server 2008
are implemented as SSIS packages, and there are specialized tasks included in SSIS to
support them, such as the Shrink Database task, the Back Up Database task, and the
Update Statistics task. Although SSIS developers can include any of these tasks in their
packages, they generally do not apply to business intelligence projects, and as such we
will not be covering them in this book.

After you create a package in BIDS, you can execute it by pressing F5 or selecting Start
Debugging from the Debug menu. When you run an SSIS package, an additional tab (the
Progress tab) becomes available for the particular SSIS package that you’re working with. It
shows you detailed execution information for each task included in the package. Also, the
control flow designer surface colors the tasks (rectangles) to indicate execution progress and
status. Green indicates success, yellow indicates in progress, and red indicates failure. Shown
in Figure 15-8 is a sample Progress tab.

FIgure 15-8 SSIS includes a package execution Progress tab.

You can also set breakpoints on tasks and use Visual Studio debugging techniques to halt
and examine execution at particular points of package execution. These include viewing the
value of executing variables in the BIDS debugging windows—that is, locals, watch, and so
472 Part III Microsoft SQL Server 2008 Integration Services for Developers

on. We’ll cover common debugging scenarios related to SSIS packages used in BI scenarios in
Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services.”

Choosing from the SSIS Menu


One of the tools that the SSIS project template in BIDS adds to Visual Studio is an SSIS menu,
which is added to the main Visual Studio menu bar, as shown in Figure 15-9.

FIgure 15-9 SSIS menu

The SSIS menu is a centralized way to access many of the SSIS-specific tools and windows.
As with any Microsoft product, there are many ways to access the SSIS-specific functionality
available in BIDS. We most often use the technique of right-clicking on the designer surface
and Solution Explorer items. Doing this opens shortcut menus that contain subsets of menu
options. As we work our way through the SSIS functionality, we’ll cover all the items listed on
the SSIS menu.

At this point, we’ve looked at the major components—windows, menus, and designers—that
SSIS adds to Visual Studio. In the next two sections, we take a closer look at the control flow
and data flow designers. In the section after that, we start drilling down into the details of
using these tools to build real-world SSIS packages.

Note One of the goals of this book is to move beyond the information that is available in SQL
Server Books Online and to give you more information, more depth, and more of what you
need to succeed with real-world SQL Server BI projects. Because of this, we’re not going to try
to reproduce or replace the excellent SSIS tutorial that is included with SQL Server. Books Online
includes a detailed five-lesson tutorial that has step-by-step instructions for building and then
extending a simple but real-world SSIS package. Search SQL Server Books Online for the topic,
“Tutorial: Creating a Simple ETL Package,” and try it out now.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 473

Connection Managers
We introduced connection managers in Chapter 14 as gateways through which SSIS pack-
ages communicate with the outside world. As such, connection managers are a mechanism
for location independence for the package. In this section, we’re going to revisit connection
managers, changing the focus from the SSIS architecture to using connection managers
within Visual Studio.

Adding connection managers to your package is a straightforward process—simply right-


click on the Connection Managers window at the bottom of the designer surface and then
select one of the New options from the shortcut menu. Also, most of the tasks and data flow
components that use connection managers include a New Connection option in the drop-
down list of connection managers so that if you forget to create the connection manager
before you create the component that needs to use it, you can create it on the fly without
needing to interrupt your task.

Standard Database Connection Managers


SSIS includes three primary connection managers to connect to database data: the ADO.NET
connection manager, OLE DB connection manager, and ODBC connection manager. When
you’re choosing among them, the primary factor is usually what drivers (which are also called
providers) are available. These connection managers all rely on existing drivers. For example,
to connect to an Oracle database through an OLE DB connection manager, you must first
install the Oracle client connectivity components. SSIS doesn’t re-invent the existing client
software stack to connect to databases. The OLE DB connection manager can be used to
connect to file-based databases as well as server-based databases such as SQL Server and
Oracle.

Note Other connection managers are available for SSIS as a download for Enterprise edition
customers after SQL Server 2008 RTMs. Additional information can be found at http://ssis.wik.is/
SSIS_2008_Connectivity_Options. If you’re looking for more details on SSIS connectivity options,
the SSIS product group created a Connectivity Wiki, where members of the product group, along
with third-party partners and vendors, post information related to SSIS connectivity. You can find
the wiki at http://ssis.wik.is. CodePlex also contains a sample that shows you how to program-
matically create a custom connection manager at http://www.codeplex.com/MSFTISProdSamples.
474 Part III Microsoft SQL Server 2008 Integration Services for Developers

Other Types of Connection Managers


Some considerations related to the Raw File connection manager are that it requires a
specific format for input. The regular file connection managers (File and Flat File) are
straightforward.

Also, the SSAS connection manager is simple to use—just provide a valid connection string to
a particular SSAS instance.

To see a complete list of available connection managers, right-click the Connection Managers
window (at the bottom of the control flow, data flow, or event handler designer surface) and
then click New Connection. New to SQL Server 2008 is the Cache connection manager. Used
with the Cache Transform, this connection manager is used to configure the behavior of the
Lookup transformation’s cache—in particular, it can be used to improve the performance of
lookups in the data flow.

Control Flow
In this section, we take a closer look at some of the core tasks used in BI ETL scenarios. As
we do this, we’ll describe different capabilities in the platform. We won’t cover the details of
every task in depth, because some tasks are virtually self-documenting and we want to focus
on building SSIS packages that facilitate BI ETL. So we’ll work to build a foundation in par-
ticular task capabilities in this section. To get started, we’ll work with a package that is part
of the samples that are available for download from www.codeplex.com/MSFTISProdSamples.
When you unzip the file, it creates a folder named MSFISPRodSamples-15927. Navigate
to the folder named Katmai_August2008_RTM\Package Samples\ExecuteProcess Sample.
The solution is named ExecuteProcessPackage Sample, and it contains a package called
UsingExecuteProcess.dtsx.

To see this sample, download and install the SQL2008.Integration_Services.Samples.xNN.msi


file from CodePlex. (xNN represents the type of operating system you’re using—x64 or
x86.) By default, the samples will be installed to C:\Program Files\Microsoft SQL Server\100\
Samples\. Open the package in BIDS by clicking on the ExecuteProcess.sln file located in
the Integration Services\Package Samples\ExecuteProcess Sample folder in your installation
folder. Then double-click on the UsingExecuteProcess.dtsx file in Solution Explorer. After you
do that, the control flow will display the configured tasks for this package and the Toolbox
window will list the control flow tasks, as shown in Figure 15-10.

Note As of the writing of this book, the sample package was written for SSIS 2005, so when you
open it, the upgrade package to SSIS 2008 wizard starts. We simply upgraded the sample pack-
age using the default wizard settings for our example.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 475

FIgure 15-10 Control flow designer surface and Connection Managers window

Note that at the bottom of the control flow designer surface, the Connection Managers win-
dow shows that this package has four associated defined connections. You’ll recall from the
previous chapter that each represents a connection to some sort of data repository. We’ll
look at connection manager properties in more detail in a later section of this chapter as well.

Our example includes flags on the Execute Process task and two of the four connection
managers. These flags are indicated by small pink triangles in the upper left of the graphi-
cal shapes representing these items. These flags are part of the functionality that the free
BIDS Helper tool adds to SSIS packages. In this example, the indicator shows that the tasks or
connections have expressions associated with them. We introduced the idea of SSIS expres-
sions in Chapter 14. You’ll recall that they are formulas whose resultant values are part of
a component’s configuration. We’ll take a closer look at SSIS expressions in the last section
(“Expressions”) of this chapter.

Note What is BIDS Helper? Although BIDS Helper is not an official part of SSIS, you can down-
load it from CodePlex. This powerful, free utility adds functionality and usability to BIDS by
adding many useful features to both SSAS and SSIS. You can get it at http://www.codeplex.com/
bidshelper.
476 Part III Microsoft SQL Server 2008 Integration Services for Developers

Control Flow Tasks


The sample package shown in Figure 15-10 contains five control flow tasks: three Execute
SQL tasks, an Execute Process task, and a Data Flow task. It also contains a Foreach Loop
container. The loop container contains a Script task. These three task types and the Execute
Package task and Script task are the five key control flow tasks that we most frequently use in
SSIS packages that perform ETL for BI solutions.

As we examine the commonly used tasks, we’ll also discuss the new Data Profiling task in
detail. In this particular sample, the task items are using the default task names and there are
no annotations. This is not a best practice! As we showed in the previous chapter, one of the
reasons to use SSIS, rather than manual scripts or code, is because of the possibility of aug-
menting the designer surface with intelligent names and annotations—you can think of this
as similar to the discipline of commenting code.

So what does this package do exactly? Because of the lack of visible documentation, we’ll
open each of the tasks and examine the configurable properties. We’ll start with the first
Execute SQL task (labeled Execute SQL Task 2). Although you can configure some of the task’s
properties by right-clicking and then working with the Properties window, you’ll probably
prefer to see the more complete view. To take a more detailed look, right-click the task and
then click Edit. The dialog box shown in Figure 15-11 opens.

FIgure 15-11 Execute SQL Task Editor dialog box


Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 477

This dialog box contains four property pages to configure: General, Parameter Mapping,
Result Set, and Expressions. Most of the properties are self-documenting—note the brief
description shown on the bottom right of the dialog box. In the General section, you config-
ure the connection information for the computer running SQL Server where you want to exe-
cute the SQL statement. The query can originate from a file, direct input, or a variable. You
can also build a query using the included visual query builder. If your query uses parameters,
and most will, you’ll use the Parameter Mapping page to configure the parameters.

The next task in the control flow is another Execute SQL task. If you examine that task’s prop-
erties, you’ll see that it executes a stored procedure called CreateProcessExecuteDest. Next
you’ll see that there is a small fx icon next to the green precedence constraint arrow immedi-
ately below this task. This icon indicates that the constraint includes an expression. We’ll take
a closer look at SSIS expressions in the last section of this chapter.

Next is an Execute Process task. This task allows you to add an executable process to your
control flow. Similar to the Execute SQL task, right-clicking the task and selecting Edit opens
a multipage dialog box where you can view and configure the available properties, as seen
in Figure 15-12. Note that this task causes an executable file named expand.exe to run. Also,
note that there’s a small pink triangle in the upper left corner of the task box. This indicates
that an SSIS expression is associated with the execution of this task. To see this expression,
click on the Expressions page of the Execute Process Task Editor dialog box. As mentioned,
the BIDS Helper utility adds this indicator—SSIS does not include visual indicators on tasks for
the use of expressions.

FIgure 15-12 Execute Process Task Editor dialog box

So far drilling down into our three tasks has followed a very similar path; however, after you
click Edit on the fourth task, the Data Flow task, something entirely new happens. You’re
taken to a different interface in SSIS—the data flow designer. Why is this?
478 Part III Microsoft SQL Server 2008 Integration Services for Developers

As mentioned in the previous chapter, starting with SSIS in SQL Server 2005, Microsoft added
a designer to develop data flows. Although it might seem confusing at first, this will make
perfect sense as you work with SSIS more and more. The reason is that the majority of the
work is often performed by this one task. In DTS in SQL Server 2000, the data flow was not
separated. This lack of separation resulted in packages that were much more difficult to
understand, debug, and maintain. We’ll take a closer look at the data flow designer shortly;
we just wanted to point out this anomaly early in our discussion because we’ve seen this con-
fuse new SSIS developers.

We mentioned that we frequently use the Execute Package task and the Script task in BI
scenarios. As you might have guessed by now, the Execute Package task allows you to trigger
the execution of a child package from a parent package. Why this type of design is advanta-
geous will become more obvious after we discuss precedence constraints in the next section.
Before we do that, we’d like to mention a significant enhancement to a frequently used
task—the Script task.

New to SQL Server 2008 is the ability to write SSIS scripts in C# or Visual Basic .NET and using
Visual Studio Tools for Applications (or VSTA). Previous versions of SSIS scripts were lim-
ited to Visual Basic .NET and VSTA only. We devote Chapter 19, “Extending and Integrating
SQL Server 2008 Integration Services,” to scripting and extending SSIS, and we’ll have more
complete coverage and examples of using this task there.

Integration Services contains many other useful tasks besides the key ones just covered. One
we’re particularly excited about is the Data Profiling task, which was developed by Microsoft
Research and is newly available in SQL Server 2008. For BI scenarios, we also use the Bulk
Insert, Analysis Services Execute DDL, Analysis Services Processing, and Data Mining Query
tasks. We’ll look at all these tasks in the next chapter as we dive into specific scenarios related
to BI, such as ETL for dimensions and fact table load for OLAP cubes and for data preparation
for load into data mining structures and models. Before we do that, though, let’s continue
our tour of the basics of SSIS packages by looking at other types of containers available on
the control flow designer surface.

Control Flow Containers


In addition to Task Host containers, which is the formal name for the objects (represented
as rectangles) that host individual tasks, the SSIS control flow contains additional container
types. You work with Task Host containers directly only when you’re developing custom
tasks. So, unless you’re doing that, they’ll be invisible to you.

These additional container types are the Sequence, Foreach Loop, and For Loop containers.
To group tasks together in sequence, you first drag a Sequence container onto the control
flow designer surface; next you drag your selected tasks onto the designer surface, dropping
them inside the container. The mechanics of the Foreach and For Loop containers are identi-
cal to the Sequence container.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 479

Also note that there are many configurable properties associated with a Sequence container.
These properties relate to the execution of the contained tasks as a group. They allow you to
create transactions and set specific failure points. In addition to the Sequence, Foreach, and
For Loop containers, SSIS contains a generic group container.

You implement the group container by selecting tasks that you want to group together
and then selecting Group from the shortcut menu. Figure 15-13 shows a sample of task
groups together. You can nest grouping containers inside of one another as well. Unlike the
Sequence container described earlier, the group container is a design convenience only. After
using it, you should rename the group container to a meaningful name and possibly anno-
tate the contents. You can collapse any type of grouping container so that the designer sur-
face is more easily viewable.

Another way to view large packages is to use the four-headed arrow that appears on the
designer surface if the package contents are larger than the available designer surface. If you
click that arrow, a small pop-up window appears. From that window, you can quickly navigate
to the portion of the package you want to work with. An example of the pop-up window is
shown in Figure 15-13.

FIgure 15-13 SSIS package navigational window

In addition to using grouping containers to make a package easier to understand by collaps-


ing sections of tasks, or to sequence (or loop) through tasks, another reason to group tasks is
to establish precedence constraints for multiple tasks. To cover that topic, we’ll next drill into
your options for configuring precedence constraints.

Before we do that, however, we’d be remiss if we didn’t also mention a couple of other rea-
sons to use containers. These include being able to manage properties for multiple tasks at
once, being able to limit variable scope to tasks that are part of a container, and being able
to quickly and easily disable portions of a package. Disabling a portion of an SSIS package is
done by right-clicking on the object (task or container) on the control flow or event handler
designer surface and then clicking the Disable option. Any disabled objects are grayed out on
the designer surface.
480 Part III Microsoft SQL Server 2008 Integration Services for Developers

Precedence Constraints
As mentioned in Chapter 14, on the control flow designer surface you have three principal
options for configuring precedence between tasks. These three options include proceeding
to the next task after successful completion of the previous task, proceeding to the next task
after failure of the previous task, or proceeding to the next task on attempted completion of
the prior task. Attempted completion means to proceed to the next task whether or not the
previous task succeeds or fails. These options are shown on the designer workspace as col-
ored arrow-lines that connect the constrained tasks and mark them with the following colors:
Success (green), Failure (red) or Completion (blue).

Of course it’s common to have more than one precedence constraint associated with a par-
ticular task. The most typical situation is to configure both Success and Failure constraints for
key tasks. Following is an example of using two constraints with a single source and destina-
tion task. You might choose to associate two Execute SQL tasks for this BI ETL scenario.

For example, after successfully completing an Execute SQL task, such as loading a dimen-
sion table, you might then want to configure a second Execute SQL task, such as loading a
fact table. However, you might also want to load neither the dimension nor fact table if the
dimension table task does not complete successfully. A first step in this scenario would be to
configure a failure action for the first Execute SQL task. For example, you can choose to send
an e-mail to an operator to notify that person that the attempted load has failed.

Note In the preceding scenario, you might also want to make the two Execute SQL tasks trans-
actional—that is, either both tasks execute successfully, or neither task executes. SSIS includes the
capability to define transactions and checkpoints. We’ll cover this functionality in Chapter 16.

To add a constraint, simply right-click on any task and then click Add Precedence Constraint.
You can add a constraint from the resulting dialog box by selecting another task to use as the
end point for the constraint. If you right-click the newly created constraint and choose Edit,
you’ll then see the dialog box in Figure 15-14. In addition to being able to specify the desired
logic for multiple constraints—that is, logical AND (all constraints must be true) or logical OR
(at least one constraint must be true)—you can also associate a custom expression with each
constraint. If you select a logical AND, the arrow appears as a solid line; if you select a logical
OR, the arrow appears as a dotted line.

In Figure 15-14 we’ve also added an expression to our example constraint. We go into greater
detail about the mechanics of expressions later in this chapter. At this point, we’ve config-
ured our constraint to be set for the Success execution condition or the value of the expres-
sion evaluating to True. Your options are Constraint, Expression, Constraint And Expression,
Expression Or Constraint.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 481

FIgure 15-14 SSIS Precedence Constraint Editor

Figure 15-15 shows how this type of constraint is represented on the designer surface—by
a green dashed arrow and a small blue expression icon (fx). Because of the power of con-
straints, it’s important to document desired package behavior in sufficient detail early in your
BI project. We often find that new SSIS developers underuse the capabilities of constraints,
expressions, and group containers.

FIgure 15-15 SSIS precedence constraint showing OR condition and an associated expression

One final point about constraints and groups is that SSIS allows you to create fully transac-
tional packages. We’ll talk more about this in the specific context of star schema and OLAP
cube loading in BI projects in Chapter 17, “Microsoft SQL Server 2008 Integration Services
Packages in Business Intelligence Solutions.” Next we’ll take a closer look at the key Data
Flow task.
482 Part III Microsoft SQL Server 2008 Integration Services for Developers

Data Flow
This section is an introduction to building a package’s data flow in BIDS, with information
on how to use sources, transformations, and destinations as the building blocks for the data
logic of a package. Common or important transformations are highlighted and described,
but this section does not provide a laundry list of all data flow components. Instead, this sec-
tion focuses on common concepts and real-world usage scenarios, and it refers you to SQL
Server Books Online for the complete list of all components.

As mentioned, it’s in no way a requirement that every SSIS package include one or more
Data Flow tasks. For an example of an SSIS package that does not include a data flow, see
the ProcessXMLData.sln solution, which is part of the CodePlex SSIS samples. It’s installed at
C:\Program Files\Microsoft SQL Server\100\Samples\Integration Services\Package Samples\
ProcessXMLData Sample by default. That being said, most of your packages will include at
least one Data Flow task, so we’ll take some time to look at how this type of task works.

To get started, we’ll continue to work with the UsingExecuteProcess.dtsx sample package that
we looked at earlier in this chapter. Note that it contains one Data Flow task. To see more,
right-click that task and then click Edit. Doing this opens the task’s contents on the data flow
designer surface. This is shown in Figure 15-16. Also, note that the contents of the Toolbox
now reflect items available for use on the Data Flow designer surface—sources, transforma-
tions, and destinations.

FIgure 15-16 Sample data flow designer surface in SSIS


Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 483

In this sample, the contents of the sample Data Flow task are simple. This task contains only a
single source, a transformation, and a single destination. In this sample, the transformation is
a Data Conversion transformation. We’ll see in Chapter 17 that in particular BI scenarios, data
flow configurations can sometimes be much more complex than this example. We’ll start by
taking a closer look at available source components.

Data Flow Source Components


As shown in Figure 15-17, SSIS data flow sources include the following six sources:

■■ ADO NET Source


■■ Excel Source
■■ Flat File Source
■■ OLE DB Source
■■ Raw File Source
■■ XML Source

FIgure 15-17 Data flow sources in SSIS

The most frequently used sources in BI scenarios are OLE DB Source, ADO NET Source, and
Flat File Source. We have also occasionally used Excel Source, Raw File Source, and XML
Source.

Our sample package (UsingExecuteProcess.dtsx) uses an OLE DB source, so we’ll examine that
type of source in greater detail next. Right-click the OLE DB source and select Edit to open
the dialog box shown in Figure 15-18. Here you’ll associate a connection manager and source
query (or table or view selection) with the OLE DB source. This selection can also be based on
the results of a variable. Next you make any name changes to the available columns for out-
put, and finally you configure error output information at the level of each available column.
We’ll take a closer look at error handling later in Chapter 16.
484 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 15-18 OLE DB Source Editor

Most components (for example, sources, destinations, and transformations) in the data flow
designer have two edit modes: simple (which we covered earlier) and advanced. You can see
the advanced editor by right-clicking on a component and then clicking on Show Advanced
Editor. The advanced editing view allows you to take a closer look at assigned values for
metadata, as well as to set additional property values.

One advanced property we sometimes work with is found on the Component Properties
tab. On this tab, you have the ability to change the default value of True for the
ValidateExternalMetadata property to False. This turns off validation of metadata, such as
database structures, file paths, and so on. We sometimes find this helpful if we’re planning to
define these values later in our SSIS package development cycle—usually by associating the
values with variables.

Before we move to the destinations, we remind you that in the data flow designer, connec-
tions between sources, transformations, and destinations are called paths, not precedence
constraints as they were called in the control flow designer. This is an important distinction.
Even though both items are represented by red or green arrows on their respective designer
surfaces they, in fact, represent different functionality. If you right-click on the OLE DB source
and then click Add Path, you’ll see a dialog box that allows you to create a path (called a con-
nector) between a source and a transformation or destination.

Although working with sources is pretty self-explanatory, we want to take a minute to dis-
cuss the specifics of the Raw File source and a common use case for it. The first point to
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 485

understand is that a raw file is created as the output of using a Raw File destination. Another
point to consider is that the Raw File source does not use a connection manager—you simply
point it to the file on the server. The Raw File source is particularly lightweight and efficient
because of its design.

The Raw File source does have some limitations, though—such as the fact that you can only
remove unused columns (you cannot add columns to it). A common business scenario for its
use is to capture data checkpoints, which can be used later if you need to recover and restart
a failed package. We’ll talk more about creating self-healing packages by using checkpoints
in Chapter 16.

Destination Components
You might be wondering why the destination components contain some different destina-
tions than are available in the source components. Note that in addition to the familiar (from
source components) ADO NET Destination, Excel Destination, Flat File Destination, OLE DB
Destination, and Raw File Destination, you can also select from additional, more specialized
destinations. However, there is no XML destination included. The ADO.NET destination is new
to SQL Server 2008. It contains an advanced option that allows you to configure the destina-
tion batch size as well. We manually set the batch size for very large data transfers so that we
can control the performance of an SSIS package more granularly.

The destination components are mostly optimized to work with various components of
SQL Server itself. They include the Data Mining Model Training, DataReader, Dimension
Processing, Partition Processing, SQL Server, and SQL Server Compact destinations. Also, you
have a Recordset destination available. These are shown in Figure 15-19.

FIgure 15-19 Data flow destinations


486 Part III Microsoft SQL Server 2008 Integration Services for Developers

At this point, we’ll just take a look at the ADO.NET destination, mostly for comparison to the
ADO.NET source component. As with most other types of SSIS data sources, using SSIS data
destination components generally requires that you associate them with a connection man-
ager (except in the case of the Raw File destination). A typical workflow first connects either
a green or red arrow to a data destination; it then configures a connection manager; and
finally, it configures any other properties that are needed to provide the business functional-
ity required.

In earlier chapters, we briefly looked at the destinations that relate directly to BI projects—
that is, Data Mining Model Training, Dimension Processing, and Partition Processing. In
the next chapter, when we look at example packages for BI scenarios, we’ll revisit these
BI-specific destinations. First, though, we’ll continue on our journey of exploring basic pack-
age design. To do that, we’ll look next at data flow transformations.

Transformation Components
Many transformations are available in Integration Services. SQL Server Books Online does a
good job providing a basic explanation of the 29 available transformations under the topic
“Integration Services Transformations,” so we won’t repeat that information here. Rather, we’ll
consider categories of business problems common in BI ETL scenarios and look at transfor-
mations that relate to these problems.

Before we start, we’ll remind you that because we’re working in the data flow designer, we
can connect to transformations from sources and other transformations. Transformations
can also be connected to other transformations and to destinations. This creates the path
or paths for our data flow or flows. These input or output connections are shown on the
designer surface by the now familiar green (for good rows) or red (for bad rows) arrows.

To help you to understand the capabilities available in SSIS as data transformations, we’ll
group the available transformations by function. For reference, see all available transforma-
tions in the Toolbox window shown in Figure 15-20.

We’ll start will those that relate to data quality—Audit, Character Map, Conditional Split,
Fuzzy Grouping, Fuzzy Lookup, Lookup, Percentage Sampling, Row Sampling, Row Count,
Term Extraction, and Term Lookup. Two considerations are related to this group of transfor-
mations. The performance of the Lookup transformation has been improved in SQL Server
2008. Also, the new Cache Transform transformation can be used to manage a cache for
lookups.

We’ll look next at the transformations that we most commonly use to prepare data for load-
ing into star schemas or data mining structures: Aggregate, Cache Transform, Copy Column,
Data Conversion, Derived Column, Export Column, Import Column, Merge, Merge Join,
Multicast, OLE DB Command, Pivot, Unpivot, Script Component, Slowly Changing Dimension,
Sort, and Union All. To do this, we’ll explore the Aggregate transformation in more detail.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 487

FIgure 15-20 Data flow transformations

Open the Calculated Columns sample package contained in the CodePlex SSIS samples men-
tioned earlier in this chapter. As with the other package, the package version as of the writ-
ing of this book is 2005, so the Upgrade Package Wizard starts when you open the package.
Click through the wizard to upgrade the package, and then double-click the package named
CalculatedColumns.dtsx to open it in the BIDS designer. Double-click the Data Flow task
named Calculate Values to open the contents of that task on the data flow designer.

Note the Aggregate transformation, which is named Sum Quantity And LineItemTotalCost. As
its name indicates, this transformation is used to perform column-level aggregation, which is
a common preparatory task in loading data into a star schema. After this transformation has
been connected to an incoming data flow, you edit this transformation to configure the type
of aggregation to perform. Each column in the data must be aggregated. As with the source
and destination components, there is a simple and advanced editing view for most transfor-
mations. Figure 15-21 shows edit information for a sample Aggregate transformation.

Finally, we’ll review a transformation that relates to working with data mining structures—the
Data Mining Query. As we mentioned in Chapter 13, “Implementing Data Mining Structures,”
the Data Mining Query transformation allows you to configure a DMX prediction query
(using the PREDICTION JOIN syntax) as part of a data flow. It requires that all input columns
be presorted.

We’ll soon be looking at SSIS packages that use these various transformations in Data Flow
tasks that solve BI-related ETL issues. We’re not quite ready to do that yet, however, because
488 Part III Microsoft SQL Server 2008 Integration Services for Developers

we have a few other things to look at in basic SSIS package construction. The next item is
data viewers.

FIgure 15-21 An Aggregate transformation

Integration Services Data Viewers


A great feature for visually debugging your data flow pipeline is the data viewer capability in
SSIS. These viewers allow you to see the actual data before or after a transformation. You can
view the data in a graphical format (histogram, scatter plot, or column chart), or you can view
a sampling of the actual data in grid form as it flows through the pipeline you’ve created.

To create a data viewer, you right-click on a red or green path and then click Data Viewers.
Then select the type of data viewer you want to use and which data from the data flow (col-
umns) you want to view. The dialog box you use for doing this is shown in Figure 15-22.

After you select and configure data viewers, they appear as a tiny icon (grid with sunglasses)
next to the path where you’ve configured them. Data viewers produce results only inside
of BIDS. If you execute your package outside of BIDS, data viewer results won’t appear. To
see how they work, we’ve created a simple package using the relational sample database
AdventureWorks LT that uses the Aggregate transformation to aggregate some table data.
We’ve added a data viewer before and after the transformation.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 489

FIgure 15-22 Data viewers

When you run the package in BIDS, the package execution pauses at the data viewer. You can
then view or copy the results to another location and then resume execution of the package.
In the following example, we’ve halted execution before the transformation and we’re using
two data viewers—grid and column chart. To continue execution of the package, click the
small green triangle in the data viewer window. You can see the results of the data flow in
Figure 15-23.

FIgure 15-23 Data viewers are visual data flow debuggers.


490 Part III Microsoft SQL Server 2008 Integration Services for Developers

Variables
Although we’ve mentioned variables several times in this chapter, we haven’t yet considered
them closely. In this section, we examine variables in the context of Integration Services.
Like any other software development platform, Integration Services supports variables as a
mechanism for storing and sharing information. For developers coming from a traditional
programming language such as Microsoft Visual Basic or Visual C#, or for those who are
experienced with database programming in a language such as Transact-SQL, variables in
SSIS might seem a little strange, mostly because of the differences in the SSIS platform itself.
To declare variables in SSIS, you use the Variables window found in the visual designers.

Variables Window
To open the Variables window, select View, Other Windows, Variables Or SSIS, and then
Variables from the menu bar. You can also display it by right-clicking in an empty area in the
package designer and choosing Variables. To define a new variable in a package, click the
Add Variable button (the first button on the left) on the toolbar in the Variables window, as
shown in Figure 15-24.

FIgure 15-24 Adding a variable

The Variables window is another window added to Visual Studio with the SSIS project tem-
plate. It shows the variables defined within the package and provides the capability to add,
delete, or edit variables. It’s worth noting that by default the Variables window displays only
user variables that are visible (based on their scope, which we’ll cover later in this section) to
the currently selected task or container. As mentioned in Chapter 14, to view system variables
or to view all variables regardless of their scope, you can use the third and fourth buttons
from the left on the Variables window toolbar to toggle the visibility of these variables on
and off.

Note BIDS Helper adds a new button to the Variables window toolbar called Move/Copy
Variables To A New Scope. It is shown in Figure 15-24 as the second-to-last button from the left.
This button adds functionality to SSIS that allows you to edit the scope of the variable using a
pop-up dialog box called Move/Copy Variables.
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 491

Variable Properties
In most development platforms, a variable simply holds or references a value, and the vari-
able’s data type constrains what values are allowed. In SSIS, however, variables are more com-
plex objects. Although each variable has a data type and a value, each variable also has a set
of properties that control its behavior, as shown in Figure 15-25, which presents the proper-
ties for an SSIS variable.

FIgure 15-25 Variable properties

As you can see from Figure 15-25, there are quite a few properties, not all of which can be set
by the developer. Here are some of the most important properties that each variable has:

■■ Description The Description property is essentially variable-level documentation. It


does not affect the variable’s function at all, but it can help make the package easier to
maintain.
■■ EvaluateAsExpression This Boolean property determines whether the Value prop-
erty is a literal supplied by the developer at design time (or set by a package com-
ponent at run time) or if it’s determined by the variable’s Expression property. If
EvaluateAsExpression is set to True, any value manually assigned to the Value property
for that variable is ignored, and the Expression property is used instead.
■■ Expression The Expression property contains an expression that, when the
EvaluateAsExpression property is set to True, is evaluated every time the variable’s value
is accessed. We’ll go into more detail about expressions in the next section, but for now
492 Part III Microsoft SQL Server 2008 Integration Services for Developers

keep in mind that having variables based on expressions is a crucial technique that is
part of nearly every real-world SSIS package.
■■ Name This property sets the programmatic name by which the variable will be
accessed by other package components. SSIS variable names are always case sensitive.
Forgetting this fact is a common mistake made by developers coming from non–case
sensitive development environments such as Visual Basic or Transact-SQL. It’s also
important to remember that SSIS variable names are case sensitive even when being
accessed from programming languages that are not inherently case sensitive them-
selves, such as Visual Basic.
■■ Namespace All SSIS variables belong to a namespace, and developers can set the
namespace of a variable by setting this property, which serves simply to give addi-
tional context and identity to the variable name. By default, there are two namespaces:
all predefined variables that are supplied by the SSIS platform are in the System
namespace, and all user-defined variables are in the User namespace by default. Please
note that variable namespaces are also case sensitive.
■■ RaiseChangedEvent This Boolean property determines whether an event is fired
when the value of the variable changes. If the RaiseChangedEvent property is set to
True, the OnVariableValueChanged event for the package is fired every time the vari-
able’s value changes. You can then build an event handler for this event. To determine
which variable changed to cause the event to fire (because there can be any number
of variables within a package with this property set), you can check the value of the
VariableName system variable within the OnVariableValueChanged event handler.
■■ Scope Although the Scope property cannot be set in the Properties window, it’s a vital
property to understand and to set correctly when the variable is created. The Scope
property of a variable references a task, a container, or an event handler within the
package, or the package itself, and it identifies the portions of a package where the
variable can be used. For example, a variable defined at the scope of a Sequence con-
tainer can be accessed by any task within that container, but it cannot be accessed by
any other tasks in the package, while a variable defined at the package scope can be
accessed by any task in the package. The scope of a variable can be set only when the
variable is created. To specify a variable’s scope, click on the package, container, or task
and then click the Add Variable button shown in Figure 15-25. If a variable is created
at the wrong scope, the only way to change it is to delete and re-create the variable if
you are using SSIS out of the box. BIDS Helper includes a tool that allows you to easily
change variable scope.
■■ Value The Value property is self-explanatory; it’s the value assigned to the variable.
But keep in mind that if the variable’s EvaluateAsExpression property is set to True,
any value entered here is overwritten with the value to which the variable’s expression
evaluates. This is not always obvious because the Properties window allows you to enter
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 493

new values even when the variable’s EvaluateAsExpression property is set to True. It
simply immediately replaces the manually entered value with the expression’s output.
■■ ValueType The ValueType property specifies the data type for the variable. For a
complete list of SSIS data types, see the topic “Integration Services Data Types” in SQL
Server Books Online.

System Variables
In addition to the user variables created in a package by the package developer, SSIS pro-
vides a large set of system variables that can be used to gather information about the pack-
age at run time or to control package execution.

For example, it’s common to add custom auditing code to a package so that the package
writes to a database table information about its execution. In this scenario, the package
developer can use the PackageName system variable to log the package’s Name property,
the PackageID system variable to log the package’s unique identifier, the ExecutionInstanceID
system variable to log the GUID that identifies the executing instance of the package, and the
UserName system variable to log the user name for the Windows account that is executing
the package. For a complete list of SSIS system variables, see the topic “System Variables” in
SQL Server Books Online.

expressions
As you’ve probably noticed, SSIS variables and expressions often go hand in hand.
Expressions can be used in many different parts of a package, including tasks, containers,
precedence constraints, and connection managers. These items support having expres-
sions applied to one of their properties. For example, a File connection manager can use an
expression to set its ConnectionString property.

Developers can add expressions on most—but not all—properties of most—but not all—
package items. Selecting the Expression property in the Properties window and clicking on
the ellipsis button opens up an editor dialog box where the developer can select a property
and then build an expression for that property.

Expressions are incredibly powerful, but they can also be incredibly frustrating at times.
This is not because of their functionality, but because of their discoverability or lack thereof.
BIDS does not provide any way to tell where expressions are being used within a package,
except by manually expanding the Expressions collection for each object. As you can imagine,
this is less than ideal and can be the cause of some frustration when working with complex
packages.
494 Part III Microsoft SQL Server 2008 Integration Services for Developers

Fortunately, help is available. As mentioned, BIDS Helper has two features designed to ease
this pain. One is the Expression And Configuration Highlighter, which places a visual indica-
tor on each task or connection manager that has a property set via an expression or a pack-
age configuration. The other is the Expressions List window, which displays a list of all object
properties in the package for which expressions have been configured. The Expressions List
window also allows developers to edit the expressions, so it’s more than just a read-only list.

Variables and Default Values Within a Package


Although we’ve already looked at many different aspects of variables in SSIS packages, so far
we haven’t really said much about actually using them. There are quite a few common sce-
narios for using variables:

■■ Capturing query results The Execute SQL task can execute SQL queries and stored
procedures that can return scalar values or tabular result sets. The editor for the
Execute SQL task has a Result Set tab where you can specify variables in which to store
the values returned by the query.
■■ Counting rows in the data flow The Row Count transformation has a VariableName
property; the variable specified here will have assigned to it the number of rows that
pass through the Row Count transformation when the Data Flow task executes. This
value can then be used later on in the package’s control flow.
■■ Dynamic queries The Execute SQL task has an SqlStatementSourceType property,
which can be set to Variable. Then the value of the variable identified by the task’s
SqlStatementSource property will be used for the query being executed. In addition,
the OLE DB source component has a SQL Command From Variable command type that
operates much the same way—you can store the SELECT statement in a variable so that
it can be easily updated. A common use for this technique is to have the WHERE clause
of the SELECT statement based on an expression that, in turn, uses other variables to
specify the filter criteria.
■■ Foreach Loop enumerators This container supports a number of different enumer-
ators—things that it can loop over. One of the built-in enumerators is the Foreach File
Enumerator. This will loop over all the files in a folder and assign the name of the cur-
rent file to the Value property of a variable specified by the package developer. In this
scenario, any number of expressions can be based on the variable that contains the file
name, and those expressions will always be current for the current file.

In scenarios like the ones just described, where a variable is updated by a package com-
ponent such as the Foreach Loop container and then used in expressions elsewhere in the
package, it’s important that the variable is configured with an appropriate default value. For
example, if a connection manager’s ConnectionString property is based on an expression that
Chapter 15 Creating Microsoft SQL Server 2008 Integration Services Packages 495

uses a variable that will contain a file name at run time, the variable should be assigned a
default value that is a valid file path.

What’s appropriate for the default value depends on the purpose of the variable and the
value it’s designed to hold. If the variable is an integer that will hold a row count, –1 is gener-
ally a safe default value, both because it’s a valid number and because it’s unlikely to occur
during package execution, making it obvious that it’s the design-time default. If the vari-
able is a string that will hold a table name for use in an expression that will be used to build
a SELECT statement, the default value should generally be a valid table name in the target
database.

There will certainly be many other situations where a default value must be selected. There
is no single solution that will fit every situation, but in most cases the default value should be
an example of the type of value that will be assigned to that variable during package execu-
tion. Just keep in mind that the variables will be used during package validation before the
package executes, and that will keep you on the right track.

There are many other ways to use variables in SSIS packages, but these common scenarios
probably are enough to show how flexible and powerful SSIS variables can be. In the chap-
ters ahead, we’ll look at more examples of using variables with specific package components,
but this should be enough to get you started.

Summary
In this chapter, we covered most of the basic package mechanics, looking at many of the
tools that BIDS provides to enable package development and at the scenarios in which they
can be used. We reviewed the tasks available in the Control Flow task and components in the
Data Flow task. We also looked at how precedence constraints are implemented, using both
constraints and expressions.

In Chapter 16, we look at some of the more advanced features in Integration Services, from
error handling, events and logging, and checkpoints and transactions to data profiling. You
will begin to see specific examples that relate to BI solutions.
Chapter 16
Advanced Features in Microsoft SQL
Server 2008 Integration Services
In Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with
Business Intelligence Development Studio,” we looked at the mechanics of working with
some of the major components that make up the Microsoft SQL Server 2008 Integration
Services (SSIS) platform. We covered the development of packages with Business Intelligence
Development Studio (BIDS), control flow, data flow, variables, expressions, and connection
managers. In this chapter, we work with some of the more advanced features available when
developing Integration Services packages, including error and event handling, logging,
debugging, and checkpoints and transactions. We also recommend some best practices
you should follow when designing Integration Services packages. Finally, we introduce data
profiling, a new feature in SQL Server 2008. The chapter progresses from the more general
activities involved in all Integration Services packages to information more specific to busi-
ness intelligence (BI), such as OLAP cube loading and maintenance and data mining structure
loading.

Error Handling in Integration Services


One compelling reason to use SSIS to perform ETL—rather than simply using Transact-SQL
scripts or custom-coded solutions—is its ease of general business logic configuration, par-
ticularly for error handling. Error-handling responses are configured differently depending on
where the error originates. Let’s start by looking at the location of most errors that you’ll wish
to trap and correct—namely, the data flow. To do this you’ll work at the level of components
on the data flow designer surface. For most sources and destinations you can edit the com-
ponent and you will see an Error Output page like the one shown in Figure 16-1. The RAW
File source and destination components do not allow you to configure an error output.

The example in Figure 16-1 is an ADO.NET data source that has been configured to provide
data from the AdventureWorksLT SalesLT.Customer table (all columns). The Error Output page
allows you to configure Error and Truncation actions at the column level. The default value is
Fail Component for column-level errors or truncations. It is important that you understand
what types of conditions constitute data flow errors. Here’s the definition from SQL Server
2008 Books Online:

An error indicates an unequivocal failure, and generates a NULL result. Such errors
can include data conversion errors or expression evaluation errors. For example,

497
498 Part III Microsoft SQL Server 2008 Integration Services for Developers

an attempt to convert a string that contains alphabetical characters to a number


causes an error. Data conversions, expression evaluations, and assignments of
expression results to variables, properties, and data columns may fail because of
illegal casts and incompatible data types.

FIgurE 16-1 Column-level error handling in the data flow

Truncations occur when you attempt to put a string value into a column that is too short to
contain it. For example, inserting the eight-character string December into a column defined
as three characters would result in a truncation.

You can change the value to either Ignore Failure or Redirect Row. If you choose the latter,
you must connect the error output from the component to the input of another component
so that any possible error or truncated rows are handled. The error output is the red path
arrow, and connecting it to another component allows these error (or truncated) rows from
the first component to flow to the second component.

In addition to understanding how to configure error handling in the data flow, it is also
important that you understand the default error-handling behavior of an SSIS package,
container, or task. At this level you will work with a couple of object-level property views
to configure a number of aspects of error handling, such as the number of allowable errors
(with a default of 1 for the package, container, or task), whether an object will execute suc-
cessfully (useful for debugging because you can set a task to fail), and more. Figure 16-2
shows some of the properties for a package that relate to error handling and execution:
FailPackageOnFailure, FailParentOnFailure, and MaximumErrorCount. The ForceExecution­
Result property (not pictured) allows you to force the object to succeed or fail.
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 499

FIgurE 16-2 Package-level default property settings

Containers and tasks have similar properties. Note the FailParentOnFailure property: The
default setting is False for all objects. Of course, changing this property to True is only mean-
ingful at a level lower than the entire package, meaning a container or task. If you do change
this value to True for a container or for a task, it will cause the parent object (the container
or package) to fail if its child object fails. We’ll talk more about task, container, and package
error handling (and recovery) later in this chapter, when we discuss checkpoints and transac-
tions. Before we do that, however, we’ll take a closer look at events and logging, because
you’ll often include responses to events that fire in your error-handling strategy.

Events, Logs, Debugging, and Transactions in SSIS


On the Event Handlers tab in BIDS, you can configure control flows that run in response to
particular types of events firing at the package, container, or task level. You can define event
handler (control) flow responses to 12 different types of events. When you define a response,
you select the entire package or a selected container or task and then select the type of
event you are interested in. We most commonly use these event handler types for error
responses: OnError and OnTaskFailed.
500 Part III Microsoft SQL Server 2008 Integration Services for Developers

Another use of event handlers is to record package execution activity for compliance
(auditing) purposes. For auditing, we often use the event handlers OnExecStatusChanged,
OnProgess, and OnVariableValueChanged. We also sometimes want to trap and log events
during the early development phases of our SSIS projects. Using this technique, we can cap-
ture, view, and analyze activities that fire events in our prototype packages. This helps us to
better understand execution overhead during early production and even pilot testing. This
type of testing is, of course, most useful for packages that are used to update OLAP cubes
and data mining structures, rather than packages that perform one-time initial SSAS object
loads. Update packages are frequently reused—we typically run them daily. Insert packages
are run far less frequently.

Events for which you have configured a custom control flow will appear in bold in the list of
possible events to capture for the particular object (the package, container, or task). We’ve
configured a simple example (shown in Figure 16-3) using the OnPostExecute event for the
entire package.

FIgurE 16-3 Package-level event handlers

After you select the object in the Executable list, select the event handler that you want to
create a control flow for, and then double-click the designer surface to activate the event.
To configure it, drag control-flow tasks to the Event Handlers designer surface. This surface
functions similarly to the Control Flow designer surface. Continuing our example, we cre-
ated a very simple package with one Execute SQL task in the main control flow, and then
we added an event handler for the OnPostExecute event for the package. Our event handler
control flow contains one Execute SQL task. This task executes when the OnPostExecute event
fires from the main control flow. Expanding the Executable drop-down list on the Event
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 501

Handlers tab of BIDS shows both Execute SQL tasks, as seen in Figure 16-4. Note that you
may have to close and reopen the package in BIDS to refresh the view.

FIgurE 16-4 Event Handler Executables object hierarchy

As you’ll see in a minute, even if you don’t configure any type of formal package logging,
when you use BIDS to execute a simple sample package, you can already see the output of
the Event Handler control flow (after the event fires).

A particularly interesting system variable available in event handlers is the Propagate vari-
able. This variable is used to control the bubble­up behavior of events—the characteristic of
low-level SSIS tasks to fire events and then to send (bubble up) those events to their parent
containers. These containers can be SSIS containers, such as Foreach Loop containers, or even
the parent package itself.

In addition to the Propagate variable, SSIS adds several variables scoped to the OnError
event, including Cancel, ErrorCode, ErrorDescription, EventHandlerStartTime, LocaleID, Source­
Destination, SourceID, SourceName, and SourceParentGUID. These variables are quite useful
in packages where you wish to capture details about packages or package component errors.

Logging and Events


The Execution Results tab is a kind of a quick logging for package execution in BIDS. If you
have configured any event handlers, as we did in the previous example with OnPostExecute
event for the package, this window includes event execution information about these events
as well. Figure 16-5 shows the logging that occurs during package execution.
502 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgurE 16-5 The Execution Results tab includes information about fired event handlers.

You might also want to capture some or all of the package execution information to a per-
manent log location. SSIS logging is configured at the package level. To do this you simply
right-click the control flow designer surface and then click Logging. This will open a dialog
box that includes five built-in options for logging locations to capture activity around pack-
age execution. You can select from multiple log types on the Providers And Logs tab, as
shown in Figure 16-6. You can specify multiple logging providers for the same package, so
for example, you could choose to configure the package to log information to both SQL
Server and a text file.

Note When you configure a log provider for SQL Server, you must specify a connection man-
ager. A logging table named sysssislog will be created (if it doesn’t already exist) in the database
that the connection manager uses. It is created as a system table, so if you are looking for it in
SSMS, you must open the System Tables folder in Object Explorer. Under Integration Services
2005, the log table name was sysdtslog90.

FIgurE 16-6 Log locations


Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 503

After you configure your log locations on the Providers And Logs tab, you select one or more
event types to capture on the Details tab, shown in Figure 16-7. The Details tab also includes
an advanced view that allows you to select the columns of information you’d like to include
for each type of event selected. You can save the logging configuration information to an
XML file by choosing Save; you can load an existing logging configuration file with the Load
button.

FIgurE 16-7 Package-level events

After you execute a package in BIDS for which you’ve configured logging, you can view the
log results in the Log Events window inside of BIDS. You open this window by selecting Log
Events on the SSIS menu. Logging is used for two main reasons: first for tuning and debug-
ging packages during the early phases of your BI project, and later for meeting compliance
requirements after the SSIS packages have been deployed into production environments. Be
aware that logging all events for entire packages is quite resource-intensive and produces
very verbose results. In our simple example we’ve captured all events (except Diagnostic) for
an SSIS package with a single control flow task, which executes the simplest possible query
(Use Master). The Log Events window output from executing this simple package is shown in
Figure 16-8.
504 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgurE 16-8 Log Events output window

The logging events you can capture and log vary by type of task in the control flow. The
most notable example of this occurs when you configure logging on a package that includes
a data flow task. You’ll notice that when you do this, the Configure SSIS Logs dialog box
contains more possible events to log than Figure 16-8 shows. The new types of events that
you can trap include several related to the activity in the data flow. Figure 16-9 shows this
expanded dialog box with several of the pipeline event types called out. Trapping activity
around the pipeline or data flow allows you to understand the overhead involved in the data
flow at a very granular level. This is important because, as we’ve said, the data flow is often
where SSIS package bottlenecks can occur.

Tip For a complete listing of the custom log entries on a task-by-task basis, see the SQL Server
Books Online topic “Custom Messages for Logging.”

FIgurE 16-9 Configure SSIS Logs dialog box

As with many other advanced capabilities of SSIS, we caution that you should plan to use just
enough logging in production situations. We often find either none or too much. Usually at
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 505

least a couple of event types should be logged, such as OnError or OnTaskFailed. If you find
yourself on the other side of the logging debate—that is, logging a large amount of infor-
mation—you should be aware that such expansive logging can add unnecessary processing
overhead to your SSIS server. For an extension to the logging capabilities included in SSIS,
see the CodePlex project at http://www.codeplex.com/DTLoggedExec, which was written by
Davide Mauri, one of this book’s authors. The tool is an enhancement to DTExec that gives
you a much more powerful logging capability.

Tip If you are logging your SSIS package execution to SQL Server and you have an SSRS instal-
lation available, you can download some SSRS report templates (RDL files) that will allow you
to view your SSIS package logs using the SSRS interface. These were originally developed for
Integration Services 2005, so you will need to update them to use the new log table name sys-
ssislog. These templates are freely downloadable from http://www.microsoft.com/downloads/
details.aspx?familyid=D81722CE­408C­4FB6­A429­2A7ECD62F674&displaylang=en.

Debugging Integration Services Packages


You have the ability to insert breakpoints visually into SSIS packages. You can insert break-
points into tasks in the control flow or into tasks in event handlers. You insert breakpoints by
right-clicking the package designer surface or an individual task or container, and then click-
ing Edit Breakpoints. Figure 16-10 shows the Set Breakpoints dialog box. Here you select the
event(s) and (optionally) the conditions under which the breakpoint will pause execution of
the package.

FIgurE 16-10 The Set Breakpoints dialog box


506 Part III Microsoft SQL Server 2008 Integration Services for Developers

You set the condition by selecting one of the options from the drop-down menu in the Hit
Count Type column. After you have successfully configured a breakpoint, a red dot is placed
on the task that has a breakpoint associated with it.

As you execute an SSIS package that has defined breakpoints in BIDS, package execution
halts at the defined breakpoints and the common debugging windows become available. We
find the Locals, Watch, and Call Stack debugging windows to be most useful when debug-
ging SSIS packages.

Of course, you learned in Chapter 15 that you can also associate one or more data view-
ers with paths (arrows) in the data flow. You saw that the data viewers allow you to perform
a type of visual data flow debugging because they halt execution of data flow until you
manually continue it by clicking the green triangle on the data viewer surface. You may also
remember that in addition to being able to view the data in the data flow in various formats
(grid, histogram, and so on) you can also copy that data to another location for further analy-
sis and then continue package execution.

Checkpoints and Transactions


SSIS package checkpoints are a type of optional save point that allows a package to be
restarted from a point of failure. We have used checkpoints with SSIS packages that may
include occasional failures resulting from (for example) a component that connects to a pub-
lic Web service to retrieve data from the Internet. A checkpoint is literally an XML file that
contains information about which components ran successfully at a particular execution date
and time. You can configure checkpoints at the package level only. Checkpoints are often
used in conjunction with transactions. We’ll get into more detail about configuring package
transactions shortly.

The key properties available for packages that are associated with checkpoint and transaction
settings are CheckpointFileName, CheckpointUsage, and SaveCheckpoints. To enable check-
points for an SSIS package, you must first supply a CheckpointFileName property by config-
uring that property. You can hard-code this XML file name, but a better practice is to use
variables and expressions to dynamically create a unique file name each time the package is
executed (based on package names, execution date and time, and so on). Next you change
the default property value for CheckpointUsage from never to IfExists. Then you change the
SaveCheckpoints value from False to True.

The next step in the checkpoint configuration process is to change the default settings on
particular control flow tasks or containers to cause the package to fail immediately when a
key task or component fails. The default behavior of a package is to continue to attempt to
execute subsequent tasks and containers in a control flow if they are not connected to the
failing task by Success precedence constraints or they are connected to the failing task by
Failure or Completion precedence constraints. When using checkpoints, however, you want
the package to stop and create a checkpoint file as soon as a task fails.
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 507

To write a checkpoint you must change the default setting of False to True for FailPackage­
OnFailure for any tasks or containers that you wish to involve in the checkpoint process.
This will cause the SSIS runtime to stop package execution as soon as the task or container
reports a failure, even if subsequent tasks are connected to the failing task with Completion
or Failure constraints. If you set the FailParentOnFailure property for a task or container to
True, this component can participate in a transaction, but no checkpoint will be written if the
particular component fails.

You may be wondering whether it is possible to create checkpoints in a data flow. The techni-
cal answer is “not by using SSIS checkpoints.” We do use the Raw File destination component
to create save points in a particular data flow. Remember that the Raw File is a particularly
fast and efficient storage mechanism. We use it to create a temporary storage location for
holding data that can be used for task or package restarts, or for sharing data between mul-
tiple packages.

Checkpoints are frequently used with manually defined transactions, so we’ll take a look at
how to do that next.

Tip If you are using SQL Server 2005 or 2008 Enterprise edition data sources in your package,
another potential method of implementing package restarts is to take advantage of the database
snapshot feature introduced in SQL Server 2005. If you are not familiar with this technique, read
the SQL Server Books Online topic “Creating a Database Snapshot.” You can use event handler
notifications in conjunction with transactions. By combining these two powerful features, you can
establish package rollback scenarios that allow you to revert to a previously saved version of your
source database. To do this, you configure a call to a specific database snapshot at a particular
package failure point.

Configuring Package Transactions


SSIS packages include the ability to define transactions at the package, container, or task
level. You configure two key property settings to enable transactions. The first is Isolation­
Level, which has the following settings available: Unspecified, Chaos, Read Uncommitted, Read
Committed, Repeatable Read Serializable (this is the default and the most restrictive isolation
level), and Snapshot. For more information about the locking behavior that each of these
isolation levels produces, read the SQL Server Books Online topic “Isolation Levels in the
Database Engine.”

The second setting is TransactionOption, which has the following settings available:
Supported, Not Supported, and Required. The default setting is Supported, which means that
the item (package, container, or task) will join any existing transaction, but will not start a
unique transaction when it executes. Required means that a new transaction is originated
by the package, container, or task configured with that setting only if an existing higher-
level transaction does not already exist. For example, if a container is set to Required and no
508 Part III Microsoft SQL Server 2008 Integration Services for Developers

transaction is started at the package level, invocation of the first task in that container starts
a new transaction. However, in the same scenario, if the container’s parent (in this case, the
package) had already started a transaction, the firing of the first task in the container causes
the container to join the existing transaction.

Tip Configuring checkpoints and transaction logic looks deceptively simple. You just change the
default configuration settings of a couple of properties, right? Well, yes—but real-world experi-
ence has taught us to validate our logic by first writing out the desired behavior (white-boarding)
and then testing the results after we’ve configured the packages. “Simpler is better” is a general
best practice for SSIS packages. Be sure that you have valid business justification before you
implement either of these advanced features, and make sure you test the failure response before
your deploy your package to production!

Transactions do add overhead to your package. SSIS will use the Microsoft Distributed Trans-
action Coordinator (MS-DTC) service to facilitate transactions if you use the Transaction­
Option property in your packages. The MS-DTC service must be appropriately configured
and running for the SSIS transactions to work as expected. There is an exception to this if
you manually control the transactions. You can use Execute SQL tasks to issue your own
BEGIN TRANSACTION and COMMIT TRANSACTION commands to manually start and stop
the transactions. Be aware that, if you take this approach, you must set the RetainSame­
Connection property of the connection manager to True. This is because the default behavior
in SSIS is to drop the connection after each task completes, so that connection pooling can
be used.

A package can use a single transaction or multiple transactions. We prefer a simpler


approach, and when configuring transactions we tend to break the package up into smaller
packages if we find a need for multiple transactions in a package. Each smaller package can
contain a single transaction. We’ll give specific examples related to solving BI-ETL solutions
around transactions in Chapter 17, “Microsoft SQL Server 2008 Integration Services Packages
in Business Intelligence Solutions.”

Tip During testing of transactions in SSIS, we find it useful to set the ForceExecutionResult prop-
erty value to Failure for tasks or containers. This is a quick and easy way to guarantee a failure
so that you can easily test your transaction, checkpoint, and recovery package logic. If you use
this technique, remember that you’ll also have to set the FailPackageOnFailure property value for
each task and container involved in the transaction to True.

As we close this section, we want to point out that you can’t set both checkpoints and trans-
actions on at the package level. You can combine checkpoints and transactions on containers
in the package, just not a package-level transaction.
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 509

Best Practices for Designing Integration Services


Packages
When it comes to designing packages, you have a lot of choices. Learning by trial and error
is one way to find out the pitfalls and shortcuts in package design. Another way is to adopt
the best practices of experienced package developers. We recommend that you follow these
best practices:

■■ Favor more and simpler packages over fewer, complex packages. For BI projects, it’s not
uncommon to use one package for each dimension and fact table and sometimes even
one package per data source per dimension or fact table.
■■ Evaluate the overhead and complexity of performing advanced transformations while
also considering the quantity of data to be processed. Favor intermediate on-disk stor-
age over intensive in-memory processing. We find that disk storage is cheaper and
more effective than intensive, in-memory transformations. We often use the Raw File
source and destination components to facilitate quick and easy data storage for this
type of process.
■■ Evaluate the quality of source data early in the project; utilize error handling, debug-
ging, logging, data viewers, and the Data Profiling task to quickly evaluate the quality
of source data to more accurately predict the volume of work to create production
packages. Favor creating intermediate on-disk temporary storage areas and small pack-
ages that perform individual units of work, rather than large and complex packages,
because these simpler packages execute more efficiently, contain fewer bugs, and are
easier to maintain and debug. Just like any other code you write, simpler is better.
■■ If choosing to process large amounts of potential bad data, favor preprocessing or
intermediate processing on excessively dirty data. Utilize logging, error handling, and
events to capture, redirect, and correct bad data. Utilize the “more-to-less” specific
methods of identifying and cleaning bad data and their associated transformations; for
example, lookup to fuzzy lookup. We’ll cover fuzzy lookups in Chapter 17.
■■ If using intensive and complex in-memory transforms, test with production levels of
data prior to deployment, and tune any identified bottlenecks. Tuning techniques
include reducing included data (for all transformation types), improving lookup perfor-
mance by configuring the cache object, writing more effective Transact-SQL queries for
the Execute SQL Query object, and more. Note that in SSIS 2008, you can use the cache
transformation to bring in data for lookups from OLE DB, ADO.NET, or flat-file sources.
(In SSIS 2005, only OLE DB sources were supported as a lookup reference dataset.)
■■ Utilize the self-healing capabilities of SSIS for mission-critical packages. These include
bubbling up events and errors and using checkpoints, database snapshots, and trans-
actions. For more information about database snapshots, see the SQL Server Books
Online topic “Database Snapshots.”
510 Part III Microsoft SQL Server 2008 Integration Services for Developers

If you follow these practices when you design packages, you’ll be following practices that we
learned the hard way to observe when working with SSIS. As with any powerful tool, if used
correctly SSIS can significantly enhance your BI project by helping you get better-quality data
loaded into your OLAP cubes and data mining structures faster and more accurately.

Tip BIDS Helper includes a useful tool for helping you to understand SSIS package perfor-
mance—the SSIS Performance Visualization tool. It creates a graphical Gantt chart view of the
execution durations and dependencies for your SSIS packages.

Data Profiling
The control flow Data Profiling task relates to business problems that are particularly promi-
nent in BI projects: how to deal with huge quantities of data and what to do when this data
originates from disparate sources. Understanding source data quality in BI projects—when
scoping, early in prototyping, and during package development—is critical when estimating
the work involved in building the ETL processes to populate the OLAP cubes and data mining
structures. It’s common to underestimate the amount of work involved in cleaning the source
data before it is loaded into the SSAS destination structures.

The Data Profiling task helps you to understand the scope of the source-data cleanup
involved in your projects. Specifically, this cleanup involves deciding which methods to use to
clean up your data. Methods can include the use of advanced package transformations (such
as fuzzy logic) or more staging areas (relational tables) so that fewer in-memory transforma-
tions are necessary during the transformation processes. Other considerations include total
number of tasks in a single package, or overall package size.

To use the Data Profiling task, simply drag an instance of it from the control Toolbox to the
designer surface. You can then set up either a quick profile or a regular profile using the
configuration dialog boxes for the Data Profiling task. Figure 16-11 shows the Single Table
Quick Profile Form options. You can open it by right-clicking the Data Profiling task, choosing
Edit, and then clicking the Quick Profile button in the lower right of the resulting dialog box.
Seven profiling options are available.

For a more granular (advanced) property configuration, you can work in the Data Profiling
Task Editor dialog box. Figure 16-12 shows this dialog box and the available configurable
properties for the Column Null Ratio Profile Request profile type.
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 511

FIgurE 16-11 Data Profile task, Single Table Quick Profile Form dialog box

FIgurE 16-12 Data Profiling Task Editor dialog box


512 Part III Microsoft SQL Server 2008 Integration Services for Developers

Before you start using the Data Profiling task, be aware of a few limitations. Currently it works
only with source data from SQL Server 2000 or later. You can work around this limitation by
staging data from other sources into SQL Server and performing the profiling on it there. We
expect this limitation to change with future updates to the task and the related viewer tool.
The Data Profiling task produces an XML-formatted string that you can save to a variable in
an SSIS package or to a file through a file connection manager. You can view the profile out-
put files using a new tool called Data Profile Viewer, which is located by default at %Program
Files%\Microsoft SQL Server\100\DTS\Binn\DataProfileViewer.exe. Data Profile Viewer is not
a stand-alone tool, and you must have SSIS installed on the computer to use it. Data Profile
Viewer cannot create profiles by itself—that must be done through an SSIS package. The
Data Profiling task can only use ADO.NET connection managers.

You might be wondering what each of the seven profiles you see in Figure 16-11 refers to.
Each profile provides a different view of the selected source data. The following list briefly
describes the functions of each type of profile:

■■ Column Null Ratio Profile Helps you to find unacceptably high numbers of missing
values (nulls) in source data of any type. Finding unexpectedly high numbers of nulls
could lead you to conclude that you may have to do more manual or more automatic
preprocessing on source data. Another possibility is that your source data extract logic
might be flawed (for example, you could be using a Transact-SQL statement with mul-
tiple JOIN clauses).
■■ Column Statistics Profile Returns information about the specific values in a numeric
or datetime column. In other words, it returns the minimum, maximum, average, and
standard deviation of the column values. As with the Column Null Ratio Profile, this
profile can help you detect the quality of the source data and identify invalid data.
■■ Column Value Distribution Profile This profile produces the overall count of distinct
values in a column, the list of each distinct value in a column, and the count of times
each distinct value appears. This can be useful for identifying bad data in a column,
such as having more than 50 distinct values in a column that should hold the number
of states in the United States.
■■ Column Length Distribution Profile As its name indicates, this profile type shows
information about the length of the data values contained in the column. It works for
character data types, and returns the minimum and maximum length of data in the
column. It also returns a list of each distinct length, along with the count of how many
column values have that length. Exceptions can indicate the presence of invalid data.
■■ Column Pattern Profile Generates a set of RegEx expressions that match the contents
of the column. It returns the RegEx expressions as well as the number of column val-
ues that the expression applies to. This is quite powerful and we are particularly eager
Chapter 16 Advanced Features in Microsoft SQL Server 2008 Integration Services 513

to use this profile to help us to identify bad source data. One business example that
occurs to us is to validate e-mail address pattern data.
■■ Candidate Key Profile Shows the uniqueness of data values as a percentage—for
example, 100 percent unique, 95 percent unique, and so on. It is designed to help you
identify potential key columns in your source data.
■■ Functional Dependency Profile Shows the strength of dependency of values in one
column to values in another—for example, associating cities and states. Mismatches
that are identified could help you pinpoint invalid data.

One additional, advanced type of profile is available: the Value Inclusion Profile Request. It
can be added only through the regular (full) dialog box (in other words, not the quick profile
dialog box). This type of profile is designed to help identify foreign key candidates in tables.
It does this by identifying the overlap in values from columns in two tables—the subset and
superset table or view in the advanced properties pane. In this same area, you can also set
the threshold levels for matching. See the SQL Server Books Online topic “Value Inclusion
Profile Request Options” for more information about the advanced property settings.

When you perform data exploration, you often set up the package to save the profile infor-
mation to a file. After you run a package that includes a Data Profiling task configured this
way, an XML file is written to the package’s configured file destination. You can use the new
Data Profile Viewer tool to examine the XML output file. Figure 16-13 shows an example
of an output file in the Data Profile Viewer tool. Again, this tool is located by default at
%Program Files%\Microsoft SQL Server\100\DTS\Binn\DataProfileViewer.exe. You can see
that we are looking at the output from the Column Value Distribution Profiles for the col-
umns from the Sales.Customer table in the AdventureWorks2008 sample database. The tool
shows the number of distinct column values and the frequency distribution for those values
numerically, graphically, and as a percentage of the total for the selected individual column
(TerritoryID).

Tip Out of the box, the Data Profiling task only allows you to profile a single table. Here’s a great
blog entry about a pretty simple work-around that allows you profile all of the tables in a par-
ticular database: http://agilebi.com/cs/blogs/jwelch/archive/2008/03/11/using­the­data­profiling­
task­to­profile­all­the­tables­in­a­database.aspx.
514 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgurE 16-13 Data Profile Viewer showing the Column Value Distribution Profiles for the Sales.Customer
sample table.

Summary
In this chapter we looked at some advanced aspects of working with SSIS packages, covering
features including error and event handling, logging, debugging, checkpoints, and imple-
menting transactions. We also took a detailed look at the new Data Profiling task. We’ve
really just begun to scratch the surface of the type of data evaluation that you’ll commonly
do at the beginning of every BI project. In Chapter 17, we apply all we’ve learned about SSIS
package development to very specific BI scenarios, such as ETL for dimension and fact-table
loading for OLAP cubes. We also present more tools and techniques focusing on applying
the SSIS tools to a data warehousing project.
Chapter 17
Microsoft SQL Server 2008
Integration Services Packages in
Business Intelligence Solutions
In this chapter, we turn our attention to the specific implementation details for packages.
We’ll look at SSIS extract, transform, and load (ETL) as it applies to OLAP cubes and data min-
ing structures.

ETL for Business Intelligence


As mentioned earlier, you need to load two fundamentally different types of structures via
SSIS ETL processes in BI projects: OLAP cubes and data mining structures. Within each of
these types of SSAS objects you’ll find a further top-level subset of functionality: the ETL for
both the initial structure load (initial cube load, for example) and also for subsequent, regular
updates to these structures. These situations have different business requirements and result
in different package designs. We’ll tackle both of these situations by providing best practices
and recommendations based on our real-world implementations.

We’ve said this already, but remember that a primary consideration in any BI project is the
determination of resource allocation for the initial ETL for the project. For all but the smallest
projects, we’ve elected to work with ETL specialists (either internal staff or contractors) during
the initial ETL preparation and actual loading phase of the project. We often find that ETL is
50 to 75 percent of the initial project work hours. We hope that the previous three chapters
on SSIS have given you a window into the power and complexity of SSIS package develop-
ment. We do not commonly find that SQL developers or administrators have the time to
master SSAS, SSIS, and SSRS. That said, if your project is simple and your source data is rela-
tively clean, we do occasionally find that smaller teams can implement BI solutions. However,
we strongly suggest that the first approach to adding resources to your BI team is around the
initial ETL.

Another consideration regarding BI ETL is the difference between the initial load of the SSAS
objects and the subsequent regular incremental updates of these objects. Deciding to what
degree to use SSIS for the initial load is significant. We find that we use a combination of data
evaluation, cleansing, and loading techniques to perform the initial load. Generally these
techniques include some SSIS packages.

515
516 Part III Microsoft SQL Server 2008 Integration Services for Developers

SSIS really shines for the incremental updating process, which is usually done at least once
daily after the cubes and mining models have been deployed to production. Package capa-
bilities, such as error and event handling, and package recoverability through the use of
checkpoints and transactions are invaluable tools for this business scenario. The most com-
mon production scenario we’ve used is to hire an SSIS expert to perform the heavy lifting
often needed at the beginning of project: creating the specialized packages that perform the
major data validation, cleansing, and so on. We then usually develop maintenance and daily
updating packages locally.

We’ve structured the specifics in this chapter around these topic areas: initial load of cubes,
further breakdown into loading of dimensions and fact tables, and then the initial load of
data mining structures. We’ll follow with a discussion about best SSIS design practices around
incremental updates for both cubes and mining models.

Loading OLAP Cubes


If you’ve been reading this book sequentially, you’ll remember that we strongly advised in
the section on SSAS OLAP cube design that you create an empty destination star schema
structure for each OLAP cube. We further recommended that you base this star schema on
grain statements that are directly traceable to business requirements. We assume that you’ve
followed these recommended practices and we base our recommendations for initial ETL
around this modeling core. That said, the first consideration for initial load is appropriate
assessment of source data quality. We cannot overemphasize the importance of this task.
Too often we’ve been called in to fix failed or stalled BI projects, and one root cause of these
situations is invariably a lack of attention to data-quality checking at the beginning of the
project—or skipping it altogether! In the next section we’ll share some techniques we use in
SSIS to assess data quality.

Using Integration Services to Check Data Quality


In Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services,” you
learned that SSIS includes a new control flow task—the Data Profiling task—specifically
designed to help you understand the quality of data. We heartily recommend that you make
full use of this task during the data-quality evaluation phase of your project. You may be
wondering whether you should use anything else. Well, obviously you can use Execute SQL
tasks and implement SQL queries on SQL source data. However, we usually prefer to just run
the SQL queries directly in the RDBMS (or on a reporting copy) rather than taking the time to
develop SSIS packages for one-off situations. However, a number of transformations included
with SSIS are not easy to do directly in SQL, such as the Fuzzy Grouping transformation. You
can also use some of the included SSIS transformations to get a statistical sampling of rows,
including Percentage Sampling, Row Count, and Row Sampling.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 517

Assessing Data Quality with the Fuzzy Grouping Transformation


An interesting approach to assessing data quality using SSIS involves the Fuzzy Grouping
transformation. This transformation is available only in the Enterprise edition of SQL Server
2008. It allows you to group together rows that have similar values. You can also use it to
group exact matches, but we tend to use the fuzzy match configuration much more than the
exact match at this phase of the project.

We’ve used this type of transformation in BI projects to help us to assess the quality of the
source data. We find it particularly useful in scenarios that include a large number of data
sources. In one situation we had 32 different data sources, some relational and some not
(such as flat file). Several of the data sources included customer information. No single source
was considered a master source of customers. Using fuzzy grouping allowed us to find pos-
sible matching data much more quickly than other types of cleansing processes, such as
scripts.

Using this transformation requires you to configure a connection manager to a SQL Server
database to house the temporary tables that will be created during the execution of this
transformation. On the Columns tab of the Fuzzy Grouping Transformation Editor (shown in
Figure 17-1), you select the columns upon which you wish to perform grouping. You can con-
figure additional input columns to be included in the output by selecting the Pass Through
option. These columns are not used for grouping purposes, but will be copied to the output.
In the bottom area, you can configure the type of grouping and other parameters related
to how the column data will be grouped. For more information, see the SQL Server Books
Online topic “Fuzzy Grouping Transformation Editor (Columns Tab).”

FIgurE 17-1 Fuzzy Grouping Transformation Editor, Columns tab

The configuration options on the Columns tab include several Comparison Flags, shown in
Figure 17-2. You use these flags to more granularly configure the types of data considered
to be possible matches. Note that you can set options such as Ignore Case, Ignore Character
Width, and more. This allows you to control the behavior of the fuzzy logic tool.
518 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgurE 17-2 Comparison Flags for Fuzzy Grouping transformations

In addition, you’ll find an Advanced tab where you can change the default Similarity
Threshold value as well as configure token delimiters such as spaces, tabs, carriage returns,
and so on. The similarity threshold allows you to control the rate or amount of similarity
required for matches to be discovered.

The Fuzzy Grouping transformation performs its grouping based on the configuration, and
adds several new columns to the results table to reflect the grouping output. While the
transformation is working, the results are stored in a temporary table. After completion these
results are included in the output from the component so that they can be used in later
transformations or written to a permanent location through a destination component. You’ll
most often work with the output column named _score. Result rows with values closer to 1 in
the _score column indicate closer matches.

Note The Fuzzy Grouping transformation makes significant use of temporary tables, particularly
with large data inputs; ensure that you have appropriate space allocated on the instance of SQL
Server where this transformation is configured to store its working tables.

Additional Approaches to Assessing Data Quality


Another approach you can use in assessing data quality is to quickly create a data mining
structure and then use one or more of the simpler data mining algorithms, such as Microsoft
Naïve Bayes or Decision Trees, to quickly create a couple of data mining models and get a
sense of source data quality by looking at associations, groupings, and outliers.

Remember that if you just want to get a quick visual representation of the data quality, you
can use the views available in the data source view object (table, PivotTable, chart, and so on)
to take a look at what you are working with.

We’ve also used the Fuzzy Lookup transformation to help us quickly translate source data
from disparate sources into a combined (usually interim) working data store. We then usually
perform further processing, such as validation of the lookup values discovered by the fuzzy
algorithm.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 519

Transforming Source Data


After you’ve done some preliminary examination of source data to assess quality, the next
step is to map source data to destination locations. An important consideration at this step
is to determine whether you’ll use a staging database as an interim storage location as the
source data goes through the various cleansing and validation processes. You should have
some sense of the amount of cleansing and transformation that you need to perform after
you do a thorough data-quality evaluation.

Our experience has shown us time after time that it is well worth purchasing a dedicated
server for this staging database. If you choose to do this, you have the added benefit of
being able to execute your SSIS packages on your middle tier, rather than on any of the
source systems or on your SSAS destination system. Although this dedicated server is cer-
tainly not required, we’ve used this configuration more often than not in real-world projects.

We definitely favor creating smaller, simpler packages and storing data on disk as it flows
through the ETL pipeline, rather than creating massive, complex packages that attempt to
perfect the source data in one fell swoop. Interestingly, some of the sample SSIS packages
available for download on CodePlex use this everything-in-one-massive-source-package
design. Figure 17-3 shows a portion of the AWDWRefresh.dtsx sample package.

FIgurE 17-3 Part of a single, overly complex package shown in BIDS. (The pink triangle signifies BIDS Helper.)
520 Part III Microsoft SQL Server 2008 Integration Services for Developers

Do you find the package difficult to read and understand when you look at it in BIDS? So
do we, and that’s our point—simple and small is preferred. Remember that the control flow
Execute Package task allows you to execute a package from inside of another package. This
type of design is often called parent and child. It’s a design pattern that we favor.

Using a Staging Server


In most situations (whether we’ve chosen to use a dedicated SSIS server or not), we create
one SSIS package per data source. We load all types of source data—flat files, Excel, XML,
relational, and so on—into a series of staging tables in a SQL Server instance. We then per-
form subsequent needed processing, such as validation, cleansing, and translations using SSIS
processes. It is important to understand that the data stored on this SQL Server instance is
used only as a pass-through for cleansing and transformation and should never be used for
end-user queries.

If you use SQL Server 2008 as a staging database, you can take advantage of several new
relational features that can help you create more efficient load and update staging processes.
The first of these is the new Transact-SQL MERGE statement. This allows what are known as
UPSERTs: INSERTs, UPDATEs, or DELETEs all performed in the same statement, depending on
the logic you write.

MERGE logic is also useful for building load packages that redirect data depending on
whether it is new. You can think of MERGE as alternative to the built-in Slowly Changing
Dimension (SCD) transformation for those types of business scenarios. MERGE performs a
validation of existing data versus new data on load, to avoid duplicates among other issues.
MERGE uses ID values to do these comparisons. Therefore, pay careful attention to using cor-
rect (and unique) ID values in data sources that you intend to merge. The following example
of a MERGE statement is taken from SQL Server Books Online:

USE AdventureWorks;
GO
MERGE Production.ProductInventory AS pi
USING (SELECT ProductID, SUM(OrderQty) FROM Sales.SalesOrderDetail sod
JOIN Sales.SalesOrderHeader soh
ON sod.SalesOrderID = soh.SalesOrderID
AND soh.OrderDate = GETDATE()
GROUP BY ProductID) AS src (ProductID, OrderQty)
ON (pi.ProductID = src.ProductID)
WHEN MATCHED AND pi.Quantity - src.OrderQty <> 0
THEN UPDATE SET pi.Quantity = pi.Quantity - src.OrderQty
WHEN MATCHED AND pi.Quantity - src.OrderQty = 0
THEN DELETE;

For more information, see the SQL Server Books Online topics, “Using MERGE in Integration
Services Packages” and “Merge (Transact-SQL).”
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 521

Of course SSIS is built to be a useful tool to assist in data transformation, so a primary con-
sideration is when to use SSIS and when to use other measures, such as Transact-SQL scripts.
As we mentioned in Chapter 16, SSIS shines in situations where you are cleaning up messy
data and you need to trap and transform data errors and events. Key transformations that
we often use to clean up data as we prepare it for load into star schema tables include the
following: Aggregate, Cache Transform, Character Maps, Conditional Split, Copy Column,
Data Conversion, Derived Column, Export Column, Fuzzy Grouping/Lookup, Import Column,
Lookup, Merge, Merge Join, OLE DB Command, Pivot/Unpivot, Script Component, Sort, and
Term Extraction/Lookup.

One cleansing strategy that we’ve seen used is to reconcile bad source data based on con-
fidence—in other words, to perform exact matches using a Lookup transformation first in a
data flow and then process error rows using less exact transformations such as Fuzzy Lookup.
The Fuzzy Lookup transformation requires that the key field from the source (the input data-
set) and lookup (the reference dataset) be an exact match. All other columns are matched
automatically using a fuzzy comparison so that the source and lookup columns have the
same columns name(s). You can adjust the type of fuzzy match by right-clicking the black
connecting arrow on the Columns tab of the Fuzzy Lookup Transformation Editor dialog
box and then clicking Edit Relationships. The exact relationships are shown with a solid black
line; fuzzy relationships are shown with a dotted black line between the source and lookup
columns.

You can also adjust the type of fuzzy match in the Create Relationships child window that
pops up after you’ve clicked Edit Relationships. The Comparison Flags column of the Create
Relationships window selects the Ignore Case flag by default for all fuzzy matches. You can
select or clear multiple other flags in this drop-down list. Flags include the following: Ignore
Case, Ignore Kana Type, Ignore Nonspacing Characters, Ignore Character Width, Ignore
Symbols, and Sort Punctuation As Symbols.

In addition to the settings available on the Reference Table tab, you can also adjust the
thresholds for (fuzzy) comparison on the Advanced tab. You can view the various settings for
a Fuzzy Lookup transformation in Figures 17-4 and 17-5.

Another consideration for commonly used transformations during this phase of your project
concerns potential resource-intensive transformations such as Aggregate or Sort. As dis-
cussed in Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration
Services,” these components use asynchronous outputs, which make a copy of the data in
the buffer. You should test with production levels of source data to ensure that these trans-
formations can be executed using the available memory on the server on which you plan
to execute the SSIS package. In the event of insufficient memory, these transformations will
page to disk and SSIS performance could become unacceptably slow. In some cases, you can
do this processing outside of SSIS. For example, most RDBMS are optimized for performing
522 Part III Microsoft SQL Server 2008 Integration Services for Developers

sorts against large sets of data. You may choose to use a SQL statement to perform a sort in
lieu of using the Sort transformation, as long as your source data is relational.

FIgurE 17-4 The Reference Table tab in the Fuzzy Lookup Transformation Editor

FIgurE 17-5 The Advanced tab in the Fuzzy Lookup Transformation Editor

Along these same lines, you’ll often use either the Merge or Merge Join transformations
when transforming and/or loading star schema tables. You can presort the data in the stag-
ing database prior to performing the merge-type transformation to reduce overhead on your
SSIS server during package execution.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 523

You can monitor overhead using Windows Server 2008 Performance Monitor counters from
these areas: SQL Server:Buffer Manager, SQL Server:Buffer Node, SQL Server:Buffer Partition,
and SQL Server:SSIS Pipeline 10.0. The collection of performance counters named Buffer* are
related to the particular SQL Server instance; the SSIS Pipeline counters are related to the SSIS
data flow engine.

Buffers Spooled is a particularly useful counter for understanding performance overhead.


This counter indicates how many buffers are being saved to disk temporarily because of
memory pressure. Figure 17-6 shows the Add Counters dialog box, in which you config-
ure what you want to monitor. This configuration is made at the operating system level.
To open the Add Counters dialog box, right-click My Computer and then click Manage.
Click Diagnostics, click Reliability And Performance, click Monitoring Tools, and then click
Performance Monitor. From there click the large green plus sign (+).

FIgurE 17-6 The Add Counters dialog box, where you monitor SSIS overhead

Another way to improve package execution for packages that use resource-intensive trans-
formations such as Split, Lookup, or Multicast is to increase the number of cores (processors)
on your server. A performance enhancement in SQL Server 2008 SSIS is to use all available
processors. In SQL Server 2005, if your data flow had only a series of synchronous compo-
nents connected together, the data flow engine would use only a single thread of execution,
which limited it to a single logical processor. In SQL Server 2008, this has been changed so
that multiple logical processors are fully utilized, regardless of the layout of the data flow.
524 Part III Microsoft SQL Server 2008 Integration Services for Developers

Data Lineage
As you begin to plan for the transformation process, your business requirements may include
the need to include data lineage, or extraction history. You may need to track lineage on fact
tables only, or you may also need to track lineage on some or all of the dimension tables. If
this is a requirement for your project, it is important that you accurately capture the require-
ments and then model your data early on in your project—in other words, from the point of
the first staging tables.

Note LineageID is an internal value that you can retrieve programmatically. For more informa-
tion, see http://blogs.msdn.com/helloworld/archive/2008/08/01/how-to-find-out-which-column-
caused-ssis-to-fail.aspx.

You can handle lineage tracking in a number of different ways, from the simplest scenario of
just capturing the package ExecutionInstanceGUID and StartTime system variables to track
the unique instance and start time of each package, to much more complex logging. The ver-
satility of SSIS allows you to accommodate nearly any lineage scenario. In particular, the Audit
transformation allows you to add columns based on a number of system variables to your
data flow. Figure 17-7 shows the dialog box that allows you to select the column metadata
that you need.

FIgurE 17-7 Audit transformation metadata output columns

If you’re using SQL Server 2008 as a staging database, another capability you may wish to
utilize for tracking lineage is change data capture (CDC). When you enable this capability
on a database and one or more of its tables, SQL Server records details about data changes
(inserts, updates, deletes) from the transaction log into CDC tables and exposes this data
via table-valued functions. You can also use change data capture to facilitate incremental
updates to your cubes; we’ll talk more about this later in the chapter.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 525

Tip We often use a simple mechanism that we call a data flow map to track data flow through
the planning into the transformation server setup phase of our projects. Using Microsoft Office
Excel or something similar, we track source data information such as physical location; load avail-
ability times; connection information; and database table names, column names, and data types
(for relational sources). We then map this information to staging tables, where we name the
transformations that need to be performed, and then map this data at the column level to the
star schema destination locations.

Moving to Star Schema Loading


Now that you’ve validated, extracted, transformed, and staged your source data, you are
ready to begin to load this data into your destination star schema and from there into your
OLAP cubes. If you haven’t done so already, it’s time for you to update and verify your pack-
age documentation standards. By this we mean that you should be using standardized and
meaningful names for packages, tasks, transformations, and so on. Much like commenting
.NET Framework source code, properly documenting SSIS packages is the mark of a mature
developer. In fact, when we interview developers as contractors on projects we often ask
to review a couple of their past packages to get a sense of their method (and discipline) of
working in the SSIS world. In this section we again assume that you are working with star
schemas that have been modeled according to the best practices discussed in earlier sec-
tions of this book. To that end, when you load your star schemas you will first load dimension
tables. After successful dimension table load, you will proceed to load fact tables. We’ve seen
various approaches to these two tasks. To be consistent with our “smaller is better” theme, we
favor creating more, simpler packages—in this case at least one per dimension table destina-
tion and one per fact table. Although you could put all of the logic to load a star schema in
one package—even to the point of loading the OLAP cube from the star schema—we advo-
cate against huge transactional packages, mostly because of complexity, which we find leads
to subtle bugs, maintenance challenges, and so on.

Loading Dimension Tables


Generally, preparing source data for load into dimension tables is the larger of the two types
of initial load ETL tasks when building OLAP cubes. This is because the original source data
often originates from multiple source tables and is usually destined for very few destination
tables, and possibly only one. We’ll illustrate this with a database diagram (Figure 17-8) show-
ing the potential source tables for an Employee dimension that could be based on original
AdventureWorks (OLTP) source tables. Here we show six source tables. (Remember that this
example is oversimplified.)
526 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgurE 17-8 Six source tables related to employee information

We have seen dimensions with 10 to 20 data sources in many projects. Often these data
sources are not all relational (flat file, Excel, and so on). Also, we find that dimensional source
data is often dirtier than fact table data. If you think of dimension source tables around
Employees and compare them with Sales data source tables, you’ll probably understand
why it is not uncommon to find much more invalid data (because of typing, NULL values,
and failing to match patterns or be within valid ranges) in dimension source tables than in
fact tables. Businesses tend to make sure that their financials (such as sales amount and sales
quantity) are correct and sometimes have less-rigid data-quality requirements regarding
other data.

When you load star schemas that you have modeled according to the guidelines that we
introduced in earlier chapters, you always load dimension tables before attempting to load
related fact tables. We’ll remind you of the reason for this: We favor building dimension
tables with newly generated unique key values—values that are generated on load of the
table data. These new values become the primary keys for the dimension table data because
these new keys are guaranteed to be unique, unlike the original keys, which could originate
from multiple source systems and have overlapping values. These new keys are then used as
the foreign key values for the fact table rows. Therefore, the dimensions must first be suc-
cessfully loaded and then the newly generated dimension primary keys must be retrieved via
query and loaded into the related fact tables.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 527

Loading Fact Tables


It is common to use a fast load technique to quickly load initial data into fact tables, particu-
larly if you’ve used staging tables as a temporary holding area for cleansed data. Fast load is
an option available on some of the data destination components, such as the OLE DB des-
tination. The SQL Server destination uses bulk load with a shared memory provider behind
the scenes and is often faster than the OLE DB destination with fast load enabled. So you can
think of using the SQL Server destination as alternative method to using an OLE DB destina-
tion. However, the shared memory provider that gives the SQL Server destination its speed
also means that you must execute packages that use it on the destination SQL Server. If you
need to run packages on a dedicated ETL server, the OLE DB destination is a better choice.

As its name indicates, fast load is a more efficient method of flowing data from the pipeline
into a named destination. To use fast load, several configuration options are required. Figure
17-9 shows the OLE DB Destination Editor dialog box where you configure the various desti-
nation load options, including fast load.

FIgurE 17-9 Fast load is an efficient mechanism for loading fact tables.

Another concern regarding fact table ETL is that you must refrain from adding extraneous
columns to any fact table. “Requirement creep” often occurs at this phase of the project, with
clients asking for this or that to be added. Be diligent in matching fact table columns to busi-
ness requirements. Fact tables can run to the tens or hundreds of millions or even billions of
rows. Adding even one column can impact storage space needed (and ETL processing over-
head) significantly.
528 Part III Microsoft SQL Server 2008 Integration Services for Developers

Of course, fast load is not just for fact tables. You can use it for dimension loads just as easily
and with the same performance benefits. In our real-world experience, however, fact tables
generally contain significantly more data than dimension tables, so we tend to use fast load
primarily when working with fact tables.

To better understand using SSIS to load dimension and fact tables, we’ll take a closer look
at one of the sample SSIS packages that are part of the group available for download from
CodePlex (as described in Chapter 15, “Creating Microsoft SQL Server 2008 Integration
Services Packages with Business Intelligence Development Studio”). Open the sample folder
named LookupSample. Notice that unlike the other samples in this group, this folder contains
only the package and does not contain the .sln file. To open this package in BIDS, create a
new project of type SSIS package and then add an existing item by right-clicking the SSIS
packages folder in the Solution Explorer window to add the existing sample SSIS package
named LookupSample.dtsx.

Notice that the control flow for this package contains two Data Flow tasks named DFT Load
Lookup Cache and DFT Load Fact Table. If you click the Data Flow tab of the BIDS designer
and then select the DFT Load Lookup Cache Data Flow task, you’ll see that it contains the
three components shown in Figure 17-10: a flat file source named DimTime Source, a Derived
Column transformation named Derived Column, and a Cache Transform transformation
named Cache Transform.

FIgurE 17-10 The DFT Load Lookup Cache data flow


Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 529

Further examination of the three items in the data flow in Figure 17-10 reveals the following
activities:

■■ Source time data dimension members are loaded from a flat file.
■■ A new unique key is created by deriving two new columns—one for the ID and one for
the name—each by using an expression in the Derived Column task.
■■ A cache object is populated with this source data for subsequent use in the package.

Next we’ll look in more detail at the second data flow, which is named DFT Load Fact Table
and shown in Figure 17-11. It contains a flat file data flow source, a Lookup transformation
that uses the Cache object created in the previous data flow, and three different branches
depending on whether a match was found in the lookup cache. It also includes error output
components, which are indicated with a red arrow and the text Lookup Error Output.

FIgurE 17-11 The DFT Load Fact Table data flow

The data flow tasks in the LookupSample.dtsx package are common design patterns for
dimension and fact table load. They are simple, which is what we prefer. We sometimes also
add more sophisticated error handling in situations where source data is dirtier (which we
may have discovered earlier in the project by using techniques such as data profiling).

After you’ve successfully loaded dimension and fact tables with the initial data load, you’ll
turn your attention to regular update ETL. The most common case that we implement is a
once-a-day update, usually done at night. However, some businesses need to update more
530 Part III Microsoft SQL Server 2008 Integration Services for Developers

frequently. The most frequent updating we’ve implemented is hourly, but we are aware of
near-real-time BI projects. If that is your requirement, you may wish to reread the section in
Chapter 9, “Processing Cubes and Dimensions,” on configuring OLAP cube proactive cach-
ing. SSIS supports all updating that is built into SSAS. We reviewed storage and aggregation
options in Chapter 9.

updates
Remember the definition of ETL for OLAP cubes: For cubes (fact tables) adding new facts
(rows to the fact table) is considered an update, and the cube will remain online and available
for end-user queries while the new rows and aggregations (if any have been defined on the
cube partition) are being added to the cube. If cube data (fact table row values) needs to be
changed or deleted, that is a different type of operation.

As we saw in Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” SSIS includes
a control flow task, the Analysis Services Processing task. This task incorporates all possible
types of cube (measure group or even individual partition) and dimension update options.
Figure 17-12 shows the task’s main configuration dialog box.

FIgurE 17-12 SSIS supports all types of measure group and dimension processing.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 531

As mentioned earlier, in the section of this chapter on staging servers, if you are using a stag-
ing database built on SQL Server 2008, you can use the new change data capture feature in
conjunction with SSIS packages to facilitate the updating of your fact and dimension tables.

From a mechanical standpoint, to configure the SSIS task to perform the appropriate action
(insert, update or delete), you’ll query the _$operation column from the change data capture
log table. This column records activity on the table(s) enabled for change data capture from
the transaction log using the following values: 1 for deleted data, 2 for inserted data, 3 and
4 for updated data (3 is the value before the update and 4 is the value after the update), or
5 for merged data. Figure 17-13 (from SQL Server Books Online) illustrates integrating CDC
with SSIS from a conceptual level. You then use those queried values as variables in your SSIS
package to base your action on—for example, if 2 (for inserted) for source dimension data,
process as a Process Update type of update in an SSIS Analysis Services control flow process-
ing task for that dimension; if 1, process as a Process Full type of update, and so on.

OLTP

Source
Tables
Log

Change Capture
Tables Process

Change Data Capture


Query Functions

Data Warehouse Extraction,


Transformation,
and Loading

FIgurE 17-13 CDC process in SQL Server 2008


532 Part III Microsoft SQL Server 2008 Integration Services for Developers

Fact Table Updates


We often find much confusion and unnecessary complexity in SSIS packages resulting from a
lack of understanding of what types of changes constitute updates. You must, of course, fol-
low your business requirements. The most typical case is that it is either an error or an excep-
tion to change or delete facts.

So you first must verify exactly how these types of updates are to be handled. It is important
to make stakeholders aware that changing or deleting facts will result in the need to fully
reprocess all fact data in that particular partition, or in the entire cube (if the cube contains
only a single partition). Knowing that, we’ve often found that an acceptable solution is to cre-
ate some sort of exception table, or to use writeback, particularly now that SQL Server 2008
supports writeback in MOLAP partitions.

As with initial fact table loads, we recommend that you use fast load to perform incremen-
tal updates to fact tables, as long as you’ve already validated and cleansed source data and
stored it in an intermediate staging area. Fast load of new data values to the staging table or
tables, which is then pushed (or updated) to the SSAS cube for incremental updates, is ideal
because it causes minimal disruption in cube availability for end users.

Cube Partitioning
As we mentioned in Chapter 9, appropriate cube partitioning is important for efficient
maintenance. The Analysis Services Processing task that you use to perform updates to
an SSAS cube “understands” partitions, so if you’ve chosen to use them, you can incor-
porate partitioned updates into your SSIS strategy. From a practical perspective, smaller
cubes—those with less than 1,000,000,000 fact table rows—will probably have accept-
able processing and query times with a single partition. However, when cubes grow,
partitioning is often needed to reduce processing times.

Dimension Table Updates


Complexity around dimension table updating is a common bottleneck in SSAS projects.
Remember that, like fact table rows, adding new rows to a dimension table is considered
an incremental update. Also, as with fact table rows, changing or deleting dimension table
rows can result in the need to fully reprocess the dimension and any cubes that reference
that particular dimension. To this end, modeling dimension tables to support update busi-
ness requirements is quite important. Remember the four types of update (change/delete)
behaviors:

■■ No Changes Allowed Any request to change data is treated as an error.


■■ Overwrite The last change overwrites and the history is lost.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 533
■■ Keep History As Row Versions Write a new row in the dimension table for history.
Next, add a date/time stamp and then mark newest value as active. Then mark older
values as inactive, and finally mark deletes as inactive.
■■ Keep History As Column Versions Write original column values to the history col-
umn. Only use this kind of update if you expect to store a limited amount of history for
a small number of columns.

SSIS contains the Slowly Changing Dimension transformation, which is a component that sup-
ports simplified creation of ETL around some of these dimension updating requirements. We
looked at this component in Chapter 5, “Logical OLAP Design Concepts for Architects.” We
like to use this transformation in simple scenarios because it’s easy, fast, and somewhat flex-
ible via the configuration built into the wizard you run when you set up the transformation.
However, the limitations of this transformation are important to understand. Most are related
to scalability. Lookup tables associated with the SCD transformation are not cached, and
all updates are row-based, which means that locking can also become an issue. In higher-
volume data dimension update scenarios we prefer to use standard lookup transformations,
or, if using staging locations, direct fact load (as described earlier in the section “Loading Fact
Tables”).

ETL for Data Mining


Preparing your source data for load into data mining models involves some of the same
considerations that you had with data destined for OLAP cubes. Specifically, you’ll want to
quality check as best you can. We find two types of business scenarios when loading data
mining structures. In some cases, a client prefers to create OLAP cubes first and then use
the cleansed data in the star schema or in the cubes themselves as source data for new data
mining models. This approach reduces the need for complex ETL specific to the data mining
load, because the source data has already been cleansed during the OLAP cube prepara-
tion process. However, sometimes we encounter the opposite situation: A client has a huge
amount of data that contains potentially large amounts of dirty data and therefore prefers to
start with data mining rather than OLAP cubes because the messiness of the data may render
it nearly unusable for OLAP queries. We alluded to this latter scenario earlier in this chapter
when we briefly discussed using data mining as a quality check mechanism.

Initial Loading
Regardless of whether you start with relational or multidimensional data as source data, you’ll
still want to use SSIS to perform quality checks on this source data. If your source data origi-
nates from OLAP cubes, your quality check process can be quite abbreviated. Quality checks
can include general data sampling to validate that data is within expected values or ranges,
534 Part III Microsoft SQL Server 2008 Integration Services for Developers

data type verification, data length verification, and any others that you identified during your
requirements phases and initial data discovery.

After you complete whatever checking you wish to do, you need to prepare your source data
for initial load into your mining structures and models. We covered data requirements by
algorithm type fairly completely in Chapter 12, “Understanding Data Mining Structures,” and
you may want to review that information as you begin to create the ETL processes for load.
Remember that various algorithms have requirements around uniqueness (key columns), data
types (for example, time series requires a temporal data type column as input), and usage.
Usage attributes include input, predictable, and so on. You can use profiling techniques to
identify candidate columns for usage types. You can use the included data mining algorithms
as part of your SSIS package, using the Data Mining Query task from the control flow or the
Data Mining transformation from the data flow. You might also choose to use the dedicated
Data Profiling task from the control flow.

Remember also that model validation is a normal portion of initial model load. You’ll want
to consider which validation techniques you’ll use, such as lift chart, profit chart, or cross-
validation, for your particular project.

Tip A new feature in SQL Server 2008 allows you to create more compact mining models. We
covered this in Chapter 13, “Implementing Data Mining Structures,” and you should remember
that you can now drill through to any column in the data mining structure. Your models can be
more efficient to process and query because they will include only the columns needed to pro-
cess the model.

Another consideration to keep in mind is that the initial model creation process is often much
more iterative than that of loading OLAP cubes. Mining model algorithms are implemented,
validated, and then tuned by adding or removing input columns and/or filters. Because of
this iteration involved in data mining, we tend to use SSIS packages less during initial load-
ing and more after initial models have been created. We then use SSIS packages to automate
ongoing model training and also sometimes to automate regularly occurring DMX queries.

Model Training
Although you will use SSIS packages to automate regular model training, you’ll probably use
SSIS packages more frequently to automate regular predictive queries using new inputs. We’ll
discuss the latter in the next section.

To automate ongoing, additional training of data mining models, you can use the Data
Mining Model Training destination from the Toolbox in SSIS. To use this component, you
must provide it with an input that includes at least one column that can be mapped to
each destination column in the selected data mining model. You can view this mapping in
Figure 17-14.
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 535

Note Model overtraining is a general concern in the world of data mining. Broadly, overtrain-
ing means that too much source data can distort or render inaccurate the resultant newly trained
mining model. If your business scenario calls for progressive retraining with additional data, it
is very important to perform regular model validation along with model retraining. You can do
this by creating an SSIS package that includes an Analysis Services Execute DDL task. This type of
task allows you to execute XMLA against the mining model. You can use this to programmatically
call the various validation methodologies and then execute subsequent package logic based on
threshold results. For example, if new training results in a model with a lower validation score,
roll back to a previous model. You would, of course, need to include SSIS task transaction logic in
your package for this technique to be effective.

FIgurE 17-14 Data Mining Model Training Editor, Columns tab

Although you may use SSIS packages to automate regular mining model training, we have
found a more common business scenario to be automating regularly occurring DMX queries.
We’ll cover that scenario next.

Data Mining Queries


If you would like to include DMX query execution in an SSIS package, you have two options
to select from: the Data Mining Query control flow task or the Data Mining Query transfor-
mation component.
536 Part III Microsoft SQL Server 2008 Integration Services for Developers

The Data Mining Query Task Editor allows you to incorporate a DMX query into an SSIS
package control flow. To configure this task, you’ll work with three main tabs: Mining Model,
Query, and Output. On the Mining Model tab you configure your connection to an SSAS
instance and then select the mining structure and mining model of interest. Next, you’ll use
the Build Query button on the Query tab to create the DMX query. Just as when you work in
BIDS or SSMS to create DMX queries, the Build Query tab opens first to the visual DMX query
builder. The tab also contains an advanced view so that you can directly type a query into the
window if desired.

As you continue to work on the Query tab, you’ll notice that you can use the Parameter
Mapping tab to map package variables to parameters in the query. On the Result Set tab,
you can choose to store the results of the query to an SSIS variable. Finally, you can use the
Output tab to configure an external destination for the task’s output.

If you stored the results of the query in a variable or an external destination, you can use
those results from other tasks in the control flow. Figure 17-15 shows the Data Mining Query
Task Editor dialog box.

FIgurE 17-15 The Data Mining Query Task Editor

Next, let’s review the workings of the Data Mining Query transformation. This transformation
allows you to take data columns from an existing data flow as input to a data mining query
Chapter 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions 537

(such as a prediction query). Working with this component is similar to working with the
Data Mining Query task in that you need to select an instance of SSAS to connect to along
with the mining structure and model you want to use. You will also create a DMX query in
the configuration dialog box that is part of this component. The key different between the
task and the transformation is shown in Figure 17-16, which shows that the Input Columns
originate from the particular section of the data flow that has been connected to this
transformation.

FIgurE 17-16 Configuring Input Columns for the Data Mining Query transformation

Now that you’ve seen the mechanics of both of these components you may be wondering
about possible business scenarios for their use.

One scenario we’ve employed is using the Data Mining Query transformation to predict the
likelihood of a new row being valid. We’ve done this by running a DMX prediction query
against a model of valid data.

Another interesting implementation revolves around incorporating predictive logic into a


custom application. At http://www.sqlserverdatamining.com, you can find an example of
doing just this in the sample Movie Click application. However, in this particular sample, the
approach taken was direct access to the data mining API, rather than calling an SSIS package.
538 Part III Microsoft SQL Server 2008 Integration Services for Developers

Summary
In this chapter we focused on high-level guidance and best practices rather than mechanics
of implementation for ETL specific to BI projects. We discussed tips for data evaluation and
went on to take a look at guidance around OLAP cube initial load and incremental updates.
We differentiated between fact and dimension table loading, followed by a look at loading,
updating, and querying data mining models.

We have still more to cover: In Chapter 18, “Deploying and Managing Solutions in Microsoft
SQL Server 2008 Integration Services,” we’re going to look first at ETL package configurations,
maintenance, and deployment best practices. In Chapter 19, “Extending and Integrating SQL
Server 2008 Integration Services,” we’ll talk about using the script components to program-
matically extend the functionality of SSIS.
Chapter 18
Deploying and Managing Solutions
in Microsoft SQL Server 2008
Integration Services
After you finish developing your packages, you have to move them to the production server,
where they are typically executed on a scheduled basis using SQL Server Agent Job Steps or
by direct invocation through the DTExec tool. In any case, you have to make sure that your
packages will be executed correctly, taking into consideration that the production server
might have different resources than your development environment (for example, different
drive letters) and also that the security context under which packages will be executed might
be completely different from the one that you used in development. In this chapter, we pres-
ent best practices for these processes. We also consider the need to protect your work from
mistakes or changing business requirements, using a complete infrastructure to manage dif-
ferent versions of packages.

Solution and Project Structures in Integration Services


As you’ve seen already, Business Intelligence Development Studio (BIDS) organizes files into
groupings of solutions and projects. A SQL Server 2008 Integration Services (SSIS) project
contains all the files needed to create at least one specific extract, transform, and load (ETL)
package. By default, an Integration Services project stores and groups the files that are
related to the package, such as *.dtsx and more.

A solution is a container of one or more projects; it’s useful when you’re working on big
projects that are better handled when broken down into smaller pieces. Keep in mind also
that a solution can contain projects of different types. For a BI solution, you can create a
single solution where you’ll be able to store the data warehouse database project, the ETL
project (packages) developed with SSIS, the OLAP structures created with SQL Server Analysis
Services (SSAS, with its own XMLA, MDX, or DMX script files), and also the reporting project
(RDL files) developed with SQL Server Reporting Services (SSRS). Although it’s possible to use
a single-solution file with multiple project groupings for your complete BI solution, in prac-
tice most projects are too complex for this type of storage to be practical.

As a best practice, we prefer to have at least one solution for each technology we choose to
use—for example, one solution for SSIS, one for SSAS, and one for SSRS. Also, we typically
create a unique solution container for each core business requirement. For example, a solu-
tion that performs the initial load ETL process to populate a star schema from a core source
539
540 Part III Microsoft SQL Server 2008 Integration Services for Developers

database would be separate from an SSIS solution that will handle ETL processes to integrate
data with external companies. As with development in general, failure to use adequate code
organization makes a BI solution difficult to manage and deploy.

As we mentioned earlier, you should keep your organization of packages as simple as is prac-
tical. If you are in doubt, favor separation of business processes, such as loading individual
dimension (or fact) table into individual SSIS packages.

Source Code Control


Development of solutions based on SSIS might require you to write a lot of packages, and
you might also find yourself working in teams with other people. In any professional devel-
opment environment, the ability to avoid any loss of any single line of code—in this case,
XML code, because packages are really XML files—should be guaranteed.

In addition, the development environment should allow you to work on the same solution
that your other colleagues are working on, without creating overlapping changes that might
cause you to lose your work. This can happen, for example, if you and a colleague work on
the same package simultaneously. The last person to save the package will overwrite the
other’s changes.

As a last point, the development environment should enable you to recover old versions of
packages, such as a package developed one or more months ago. You might need to do this
because a feature that has been removed from your ETL logic is needed again or because
changes to a package saved earlier have broken some ETL logic and the earlier version needs
to be recovered. Generally, in your working environment you should have a central reposi-
tory of code that acts as a vault where code can be stored safely, keeping all the versions
you’ve created, and from where it can be reclaimed when needed.

You need a source code control system. One of the Microsoft tools you can use to implement
this solution is Visual SourceSafe (VSS). VSS is designed to directly integrate with the Visual
Studio (and BIDS) integrated development environments. Another option is Team Foundation
Server, which includes version control and other features, such as work item management
and defect tracking. After you have your source code control environment set up, you and
your team will store your solution files in a central code repository managed by that source
code control system.

Tip Even though we’re focusing on using VSS as a source control mechanism for SSIS package
files—that is, *.dtsx files—it can also be used for other types of files that you’ll be developing in a
BI project. They can include XMLA, MDX, DMX, and more.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 541

Using Visual SourceSafe


If you decide to use VSS, you need to verify that after the installation a Visual SourceSafe
database is available for use. To do that, you can just start the Visual SourceSafe client from
the Start menu. If no Visual SourceSafe database has been created, a wizard is available to
help you create one: the Add SourceSafe Database Wizard.

A database for VSS is actually just a specific folder in the file system. To create a new data-
base for VSS, you open the wizard by using the VSS Administrator (or Explorer) and accessing
the Open SourceSafe Database menu option and then clicking Add. On the next page of the
wizard, if you’re creating a new VSS database, enter the path and folder name where you’d
like the new VSS database to be created. It’s a best practice to put the VSS database in a
network share so that all the people on your BI development team will be able to connect to
that database from the shared network location.

After you create the new VSS database, you need to specify the name of that database. After
you complete this step and click Next, the Team Version Control Model page appears, which
allows you to define how Visual SourceSafe will work, as shown in Figure 18-1. You can select
the Lock-Modify-Unlock Model option or the Copy-Modify-Merge Model option. We recom-
mend that you select the Lock-Modify-Unlock Model option.

FIgure 18-1 Team Version Control Model page of the Add SourceSafe Database Wizard

If your VSS database is configured to use the Lock-Modify-Unlock model, only one person
at a time can work on a package, which helps you to avoid any kind of conflict. The Copy-
Modify-Merge model allows more than one person to work on the same file, allowing people
to resolve conflicts that might happen if and when they modify the same line of code. We
feel that the latter model is more appropriate for traditional application development rather
than for SSIS package development. The reason for this is that although SSIS packages can
542 Part III Microsoft SQL Server 2008 Integration Services for Developers

contain large amounts of functionality, you should favor creating smaller, more compact
packages rather than large and complex ones (as we’ve mentioned elsewhere in this book).
Of course, there might be exceptions to this general guideline; however, we’ll reiterate our
general development guideline: keep it as simple as is practical. We’ve seen many SSIS pack-
ages that were overly complex, probably because the developer was working based on her
experience with traditional application (that is, class) development.

Tip Merging SSIS package versions using VSS’s merge capability doesn’t work very well with
the SSIS XML structure, because it’s easy for even minor layout changes in the designer to result
in major differences between the versions. This is another reason we prefer the Lock-Modify-
Unlock model.

After you’ve created and configured the VSS database, you also have to create the appro-
priate number of VSS user accounts for you and anyone else on your BI development team
who’ll need to access the source code. VSS doesn’t use Windows Authentication, so you’ll
have to create the accounts manually or through the VSS application programming inter-
face (API). You can administer user accounts and their account permissions with the Visual
SourceSafe Administrator tool (shown in Figure 18-2), which you can find under the Microsoft
Visual SourceSafe, Microsoft Visual SourceSafe Administration menu item available from the
Start menu.

FIgure 18-2 Managing VSS users with the Visual SourceSafe Administrator tool

After you’ve completed these setup steps, you can start to use VSS from BIDS. If you’re cre-
ating a new solution, you can simply tell BIDS that you want to store solution files in Visual
SourceSafe right from the beginning. To do this, just select the Add To Source Control option
in the New Project dialog box.

Alternatively, if you have an existing solution that you want to add to the source code control
system, you can do it by simply selecting the Add Solution To Source Control item on the
solution’s shortcut menu from the Solution Explorer window as shown in Figure 18-3.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 543

FIgure 18-3 The solution’s shortcut menu after having installed a source code control system

If Add Solution To Source Control doesn’t appear on the shortcut menu after you right-click
on the solution name in the Solution Explorer window, you might need to configure BIDS to
work with the installed VSS system manually. You can do that through the Options dialog box
in BIDS as shown in Figure 18-4.

FIgure 18-4 You can configure which source code control system to use from the BIDS Options dialog box.

After you select the Add Solution To Source Control item, you’re asked to authenticate your-
self using Visual SourceSafe credentials, which include your VSS user name, VSS password,
and the VSS database name.
544 Part III Microsoft SQL Server 2008 Integration Services for Developers

After doing that, you need to specify where you want to save the project inside the VSS data-
base. As we mentioned, that database is very similar to a file system, so you have to provide
the path for your project. This path will be used inside the VSS database to separate your
project files from others’ project files. The folder location is preceded in the dialog box by a
dollar sign ($). This is just the naming convention used by VSS.

After clicking OK, your project will be stored in the location you’ve specified in the VSS data-
base. From here on, all the project files visible in the Solution Explorer window will have a
little icon on their left that indicates their status with respect to VSS. For example, the small
lock icon shown in Figure 18-5 next to the SSIS package file named ProcessXMLData.dtsx
indicates that the file is not being edited by anyone and it’s safely stored on the server where
the VSS database resides.

FIgure 18-5 Element status icons are visible when a solution is under source control.

A red check mark indicates that you’re currently editing the file, and no one else will be able
to edit it. This status is called Checked Out. VSS requires that you check out files before you
can start to edit them. You can check out a file by right-clicking the file name in Solution
Explorer and then clicking Check Out For Edit, as shown in Figure 18-6.

FIgure 18-6 The specific source code control menu items available on the shortcut menu
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 545

After you successfully check out a file, the source code control system makes a local copy of
that file on your computer—by default in your project folder, which is typically placed inside
My Documents\Visual Studio 2008\Projects. All the changes you make to that file remain on
your computer only, until you’re satisfied with your work and you want to save it back into
the central database, which will create a new version and make the updated version available
to the rest of your BI development team.

The operation that updates the source code control database is called Check In and is avail-
able on the shortcut menu of the checked-out file, as shown in Figure 18-7.

FIgure 18-7 The solution’s shortcut menu after having checked out a file

As mentioned, the Check In operation copies your local file to the source code control data-
base, but it also keeps a copy of the original file. In this way, you can have all the versions
of the package always available if you ever need to restore a previous version of that file.
One method to revert to a previous version of a file is by using the View History item, which
is available from the shortcut menu of the file for which you want restore an old version.
Choosing the View History item opens the History dialog box shown in Figure 18-8. You can
select (Get) any previous version from this dialog box. Not only can you restore a file to a
previous version, but you can also perform any of the operations shown in Figure 18-8—for
example, Diff (which allows you to compare two versions).

As you can see, there is a lot of functionality available in VSS. Also, you can do many more
things with a source code control system than the activities introduced in this chapter, which
aims to give you just an operational overview of such a system. If we’ve piqued your interest
in using a source code control system, you can learn more by reading the help files of the
source code control system you’ve selected. MSDN contains some useful basic documenta-
tion on VSS at http://msdn.microsoft.com/en-us/library/ms181038(VS.80).aspx.
546 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 18-8 The VSS History dialog box, where you can view and get all previous versions of a selected file

Note Other source control systems, such as Visual Studio Team System (VSTS) and Team
Foundation Server are available. Although a discussion of the functionality of VSTS is outside
the scope of this book, we encourage you to explore whatever source control systems you have
available in your particular development environment. At a minimum, you and your team must
select one common system and use it. There are too many parts and pieces (literally files) in a
BI project to consider proceeding without some kind of standardized source control. For an
introduction to VSTS, see the MSDN documentation at http://msdn.microsoft.com/en-us/library/
fda2bad5.aspx.

The Deployment Challenge


After you’ve developed, debugged, and tested your ETL SSIS packages, you will want to
deploy your packages to production or to preproduction servers—the latter if your project is
being implemented in a more complex (enterprise) environment.

Note If you used SQL Server 2000 Data Transformation Services (DTS), SQL Server 2008 SSIS
deployment will be a new process for you. With SQL Server 2000 DTS, packages were usu-
ally created on the same server where they would be executed, thus eliminating the need to
deploy them. Of course, you might have used a server for development and another server for
production and therefore had to move DTS packages from one to the other when you finished
development. However, the concept with SQL Server 2000 is that packages were directly saved
on the server where they were developed. To do that, you used a specific application—the DTS
Designer—to create those packages.

From SQL Server 2005 and on, package development has moved from Enterprise Manager to
BIDS, and it has changed mechanically as well. In this section, we’ll cover deployment options
available for SSIS 2008 packages.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 547

All package development is initiated from the local developer’s machine, where you have
installed BIDS. All the packages in your project are initially saved as files with the .dtsx exten-
sion on your local machine (or to a shared location if you’re using VSS). So, at the end of
development, you have to move (deploy) these packages to the target production or prepro-
duction server or servers so that they can be scheduled and executed through the SSIS run-
time without the need to run them from BIDS.

You know from reading the last chapter that there are two fundamentally different types of
ETL processes associated with BI solutions: initial structure load, and incremental update of
these cubes and mining models. Depending on the scope of the project, you’ll sometimes
find yourself running the initial load ETL package directly from your developer machine. For
larger projects, this won’t be the case. Invariably, the follow-on incremental update ETL pack-
ages for both the dimension and fact tables will be run from a production server and will be
regularly scheduled by a database administrator.

How does the deployment process of a package work? Because you’re handling files you
want to deploy, everything is pretty straightforward. However, before starting the deploy-
ment you first need to decide where on the target server you want to deploy your pack-
ages. There are three possible locations on the server where you can save your packages, as
described in the following list:

■■ File deployment Package is stored in an available directory on the server’s file system.
■■ Package Store MSDB deployment Package is stored inside the sysssispackages table
in the SQL Server msdb database.
■■ Package Store File deployment Package is stored on the server’s file system
in a Package Store configured path, typically in %Program Files%\Microsoft SQL
Server\100\DTS\Packages.

Although the difference between msdb deployment and the other two options should be
clear, you might be wondering what kind of difference exists between file deployment and
Package Store File deployment, because they both store the package on the file system.
While file deployment basically allows you to use any directory available in the file system,
Package Store File deployment manages and monitors the %Program Files%\Microsoft SQL
Server\100\DTS\Packages folder so that you can see and manage its content right from SQL
Server Management Studio (SSMS). The Package Store is handled by the SSIS service, which
implements monitoring and cataloging functionalities. This allows you to keep an ordered
and clear catalog of available packages. We’ll talk about the Integration Services service fea-
tures again later in this chapter.

So how do you decide which location to use? There are two main points that you have to
consider to answer that question.

The first is simplicity of deployment and management. Using a file system as a storage loca-
tion allows you to deploy your packages by just copying them to the target folder, while the
548 Part III Microsoft SQL Server 2008 Integration Services for Developers

usage of the msdb database requires you to deploy the packages by using the dtutil.exe tool,
importing them through SSMS, using a custom deployment utility that leverages the SSIS API,
or using BIDS to save a copy of your package in the msdb database. Also, the management of
package files is easier with a file system because you’re simply dealing with files. For example,
the backup of all the packages stored simply requires copying them to a safe location. With
this solution, you can also easily decide to back up only some packages and not all of them.
Using msdb this is also possible, but it’s not as straightforward. You must use the dtutil.exe
tool or SSMS to export the packages you want to back up so that you can have them stored
in a .dtsx file, and then move that file to a secure place. However, backing up all the packages
in msdb is as easy as backing up the msdb database and might already be covered under
your existing SQL Server backup strategy.

On the other hand, you should also consider the security of package information. When a
package is stored in a file system, you might need to encrypt some values, such as passwords,
so that they cannot be stolen. This type of problem doesn’t exist with a package stored in
msdb because access to any part of the package is managed by SQL Server directly and
thus access is granted only to those who have the correct permission defined. There is more
complexity to properly securing SSIS packages, and we’ll discuss this later in the chapter (in
“Introduction to SSIS Package Security”), when we discuss security configuration options in
the context of the ways you can manage and execute packages.

When deploying SSIS packages, you also have to consider package dependencies, such as
files, file paths, connection strings, and so on. To avoid encountering errors when packages
are executed on production servers, you need to be able to dynamically configure the loca-
tions of such resources without having to directly modify any package properties using BIDS.
BIDS might not even be installed on production servers. Also, modifying packages directly
after development might introduce bugs and other issues. So, what you really need is a sepa-
rate location where you can store all the values that might have to change when you deploy
the package to the server. This way, when the package is executed, the values will be taken
from this defined, external configuration. SSIS offers this feature, which is called Package
Configurations.

Package Configurations
In BIDS, you can enable a package to use an external configuration file by selecting the
Package Configuration item from the SSIS menu. You’ll see the Package Configurations
Organizer dialog box shown in Figure 18-9. Note that you can associate more than one
external configuration with a single SSIS package.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 549

FIgure 18-9 Package Configurations Organizer dialog box

The Enable Package Configurations check box allows you to specify whether the package
will try to load the specified package configurations. Package configurations can be created
by clicking the Add button, which starts the Package Configuration Wizard. On the Select
Configuration Type page (shown in Figure 18-10), you have to select where you want to store
the package configuration data.

FIgure 18-10 Select Configuration Type page of the Package Configuration Wizard
550 Part III Microsoft SQL Server 2008 Integration Services for Developers

The Configuration Type drop-down list allows you to choose from the following five types of
storage locations for variable values:

■■ XML Configuration File Stores object property values using an XML file.
■■ Environment Variable Allows you to use an environment variable to save the value of
an object property.
■■ Registry Entry Stores an object property value into a registry key.
■■ Parent Package Variable This option is useful when you have a package that can be
executed from another package. It allows you to bind an object property value to the
value of a variable present in the calling package.
■■ SQL Server Stores object property values using a SQL Server database table.

The most common way to store files is to choose either the XML Configuration File or SQL
Server option. Both of these options allow you to select and store multiple property values
in the external configuration. We like the flexibility and ease of use of XML files. Using a SQL
Server table is a good solution when you want to create a single place where you’ll store all
your packages’ common configurations and when you want to be sure to protect them using
SQL Server’s security infrastructure.

After you’ve chosen the configuration type you want to use, you can select the property val-
ues that you want to be saved into the package configuration. For example, if you need to
have the value of a variable called SystemRoot stored in a package configuration, all you have
to do is choose the Value property of that variable on the Select Target Property page of the
wizard, as shown in Figure 18-11.

After you’ve enabled and defined the package configuration, the package will attempt to
load that configuration each time it’s executed. It will also attempt to load the configuration
when the package is opened in the BIDS environment. If at runtime the package configura-
tions cannot be found, the package generates a warning and continues the execution using
the values available directly within the package. If more than one configuration is specified,
the package uses all of them, using the last value available if two or more package configu-
rations try to set the value for the same property. Because of this potential for overwriting
information, it’s a best practice to configure variable values only once in external package
configurations. Also, you should name your configuration files to reflect which values they are
updating—that is, File Path To MainFrame A Dump, Connection String For Database B, and
so on.

For example, if you have a variable A that has been configured to store and load its value
from a package configuration, you specify three package configurations to be used, and all of
them contain a value for variable A, the value that will be used is the one of the last package
configuration present in the list.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 551

FIgure 18-11 Selection of property values to be exported to a package configuration

Here’s an example. Suppose that these are the package configurations set in the Package
Configurations Organizer dialog box:

■■ Package Configuration One: Tries to set A to 100


■■ Package Configuration Two: Tries to set A to 200
■■ Package Configuration Three: Tries to set A to 300

At runtime, the value for A will be 300.

This can seem confusing, but it can be very useful when you need to make an exceptional
(such as a one-time-only) execution of your package using some specific configuration val-
ues that should be used only for that execution. Thanks to the way in which package con-
figurations are loaded and used, you can simply create a special package configuration for
that particular execution and put it as the last one in the list, and that’s it. Without having
to modify the usual package configuration, you can handle those particular exceptions. Of
course, this type of complexity should be used sparingly because it can lead to confusion and
undesirable execution results.
552 Part III Microsoft SQL Server 2008 Integration Services for Developers

As we’ve discussed, package configurations are referenced directly from the package and
are enabled using the previously mentioned Package Configurations Organizer dialog box.
Because your package might need to reference different package configurations when
deployed on production or preproduction servers, you might also need to specify the pack-
age configuration it has to use directly when invoking its execution. You can do that with
DTExec, DTExecUI, or the specific Integration Services Job Step.

Note It’s important that you understand that the package stores a reference to the configura-
tion internally, in the form of a hard-coded location (a path for XML files, for example). Also, with
SQL Server 2008, the configuration behavior has changed. So you can override a SQL Server–
based configuration connection string using the /CONN switch. For more information, see the
“Understanding How Integration Services Applies Configurations” section in the following white
paper: http://msdn.microsoft.com/en-us/library/cc671625.aspx.

Copy File Deployment


This is by far the simplest method. You can simply take the .dtsx file you have in your solu-
tion directory and copy it to your target folder on the destination server. In this case, you
typically need to have a file share on the server where you can copy packages, as shown in
Figure 18-12.

FIgure 18-12 File copy deployment

If the directory on the server is not the directory managed by the SSIS Package Store, the
packages stored here won’t be visible from SSMS. If you want to have packages visible and
manageable from there, you have to copy your .dtsx files to the SSIS Package Store directory,
which is %Program Files%\Microsoft SQL Server\100\DTS\Packages by default.

In that folder, you can also organize packages using subfolders. For example, you might want
to deploy your package Sample Package 1.dtxs so that it will be stored in a folder named
SQL2008BI_SSIS_Deployment. Because this folder is used by the Integration Services service,
you can see its content right from SQL Server Management Studio as shown in Figure 18-13.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 553

FIgure 18-13 SSMS Package Store folder browsing

And from there, you can manage the deployed package. Management tasks include schedul-
ing package execution, logging package execution, and assigning variable values at execu-
tion runtime. Also, management includes backing up the storage location or locations, such
as the file system or msdb database where the packages are stored.

BIDS Deployment
Packages can also be deployed directly from BIDS. This is a very simple way to deploy pack-
ages because you don’t have to leave your preferred development environment. To accom-
plish this type of deployment, all you have to do is select the Save Copy Of <xxx file> As item
from the BIDS main File menu. To have this menu item available, you have to verify that you
have the appropriate package selected in the Solution Explorer window. One quick way to
verify this selection is to look at the Properties window, shown in Figure 18-14. You should
see the name of the package that you want to save as shown in this window.

FIgure 18-14 Package Properties window


554 Part III Microsoft SQL Server 2008 Integration Services for Developers

Otherwise, you might not be able to see the Save Copy Of…As menu item. After selecting the
previously mentioned menu item, the Save Copy Of Package dialog box appears, as shown in
Figure 18-15.

FIgure 18-15 Save Copy Of Package dialog box

In this dialog box, you can decide where the package will be deployed through the Package
Location property. The list box allows you to choose from the three locations we discussed
earlier: SQL Server (MSDB), SSIS Package Store (which allows you to deploy to either a man-
aged path or msdb), or a defined file system path.

For the SQL Server and SSIS Package Store options, you also provide the target server and
the authentication information needed to connect to that server. For the SSIS Package Store
option, only Windows Authentication will be available; while for SQL Server, you could alter-
natively use SQL Server Authentication.

In the Package Path text box, enter the path and the name that the package will have once
it’s deployed on the new server. The Protection Level option allows you to decide whether
to encrypt your package after it has been deployed and how to encrypt it. For the package’s
protection level, you can choose from six options. These options are explained in “Setting the
Protection Level of Packages” in Microsoft SQL Server Books Online as follows:

■■ Do Not Save Sensitive This protection level does not encrypt; instead, it prevents
properties that are marked as sensitive from being saved with the package and there-
fore makes the sensitive data unavailable. If a user opens the package, the sensitive
information is replaced with blanks and the user must provide the sensitive information.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 555
■■ Encrypt All With Password Uses a password to encrypt the whole package. To open
or run the package, the user must provide the package password. Without the pass-
word, no one can access or run the package.
■■ Encrypt All With User Key Uses a key that is based on the current user profile to
encrypt the whole package. Only the same user using the same profile can load, mod-
ify, or run the package.
■■ Encrypt Sensitive With Password Uses a password to encrypt only the values of
sensitive properties in the package. DPAPI is used for this encryption. DPAPI stands for
data protection API and is a standard in the industry. Sensitive data is saved as a part of
the package, but that data is encrypted by using the specified password. To access the
package, the password must be provided. If the password is not provided, the pack-
age opens without the sensitive data so that new values for sensitive data have to be
provided. If you try to execute the package without providing the password, package
execution fails.
■■ Encrypt Sensitive With User Key Uses a key that is based on the current user profile
to encrypt, using DPAPI, only the values of sensitive properties in the package. Only the
same user using the same profile can load the package. If a different user opens the
package, the sensitive information is replaced with blanks and the user must provide
new values for the sensitive data. If the user attempts to execute the package, package
execution fails.
■■ Rely On Server Storage For Encryption Protects the whole package using SQL Server
database roles. This option is supported only when a package is saved to the SQL
Server msdb database. It’s not supported when a package is saved to the file system.

As you can see, you have several choices to protect your sensitive package data, so it’s impor-
tant for you to understand what type of data can be considered sensitive. Certain property
values are obviously sensitive, such as passwords, while other values do not so obviously
need encryption or security. It’s important that you document in your business requirements
which information is considered security-sensitive and then create your package configura-
tions accordingly. You should be aware that the default setting is Encrypt Sensitive With User
Key. Of course, like any type of security that uses key-based encryption, this type of security
depends on appropriate key storage mechanisms. You should partner with your network
administrators to ensure that appropriate—that is, nondefault—mechanisms are in place for
such key storage if you choose to use this type of encryption for your SSIS packages.

We advocate that clients use “just enough” security in SSIS package security design (and
we implement that ourselves). That is, we frequently stick with the default security setting,
(Encrypt Sensitive With User Key) because of its small overhead and ease of use. As men-
tioned, the only complexity in using this setting is appropriate key storage.
556 Part III Microsoft SQL Server 2008 Integration Services for Developers

We’ve used password-based encryption in scenarios where no key management was in place.
We prefer to refrain from password-based encryption for a couple of reasons. The first is
password management—that is, storage, complexity, and recovery. The second reason is that
because of people being involved with passwords, use of this security mechanism is inher-
ently less secure than a key-based system.

Of course, we have encountered scenarios where the entire package needed to be encrypted.
This amount of encryption adds significant overhead to package execution and should be
used only in business scenarios that specifically call for it.

Deployment with the Deployment Utility


The last option you have for deploying a package to a target server is the use of the BIDS
SSIS Deployment Utility. To enable this feature, you have to set a specific project property.
You do this by right-clicking on the project in the BIDS Solution Explorer window and then
clicking Properties on the shortcut menu.

After selecting Properties, the project property window appears, as shown in Figure 18-16.
In the Deployment Utility section, you have to set the CreateDeploymentUtility property to
True. From here on, every time you build the project by selecting Build <project name> from
the Build menu in BIDS, all the packages present in the project will be copied into the direc-
tory defined in DeploymentOutputPath, where a special XML-based file with the extension
SSISDeploymentManifest will be created.

FIgure 18-16 The project’s property window, where you can enable the Deployment Utility
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 557

Now, to deploy your packages to the server you want, all you have to do is double-click on
that manifest file. After you do that, the Package Installation Wizard opens. By going through
the pages of the wizard, you can decide where to deploy packages. In this case, you’re limited
to only two locations: SQL Server or the file system, as shown in Figure 18-17.

FIgure 18-17 Package Installation Wizard location selection window

When you click Next, you see the Select Installation Folder page if you chose File System
Deployment; if you chose SQL Server Deployment, you see the Specify Target SQL Server
page. Then, when you click Next on either of those pages, the Confirm Installation page
appears, which simply states it has enough information to start and tells you to click Next to
start the installation. Finally, you see the Finish The Package Installation Wizard page, where
you can view what’s been done and click Finish.

After all package and package configuration XML files have been deployed, you have the
chance to change the values saved in the XML configurations used by your packages. Keep in
mind that you can do that only if they’re using an XML configuration file. On the Configure
Packages page of the Package Installation Wizard, shown in Figure 18-18, there are two
configurable properties: one for a file name and path, and the other for a file name. Both
properties are configured dynamically via this XML file, and their values are retrieved at SSIS
package runtime.
558 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 18-18 The Configure Packages page of the Package Installation Wizard

It’s a best practice to use package configurations in general and XML files in particular, as
they are the most flexible for both deployment and for subsequent customizations. We
remind you that if sensitive information is stored in these XML files, you must consider how
you will secure this information. Alternatively, as mentioned, you can choose to store con-
figuration values in SQL Server and thereby take advantage of SQL Server security. Finally,
if you don’t want to allow configuration changes during deployment, you can disable it by
specifying the value False for AllowConfigurationChanges in the project’s Deployment Utility
properties.

Note In addition to using the SSIS Deployment Utility to deploy SSIS packages, you can also use
SSMS or the command-line utility dtutil.exe to accomplish package deployments.

SQL Server Agent and Integration Services


After your SSIS packages have been deployed to the target server or servers, you need to
schedule their execution. To do that, you’ll generally use SQL Server Agent by creating a spe-
cific job for the execution of each (normally updated) package.

To create a new job, just expand the SQL Server Agent menu item from the SSMS so that
you can see the item Jobs. Right-click Jobs and select the New Job option from the shortcut
menu. You need to provide a name for the job.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 559

A job is made of smaller elements called steps, which can perform specific actions. To create
a new step, select the Steps page on the left side of the dialog box, and click the New but-
ton at the bottom of the dialog box. A name for the job step needs to be provided, and you
need to select the type of step to perform. In this case, you use the job step type SQL Server
Integration Services Package to run the desired package, as shown in the New Job Step dia-
log box in Figure 18-19.

FIgure 18-19 New Job Step dialog box

Here you’ll work with an interface similar to the one offered by DTExecUI where you can
choose the package you want to run, specifying all the options you might want to configure.
After you select the package and configure its options, you then decide on the schedule and
the notification options just like any other job you define with the SQL Server Agent.

If you prefer to invoke package execution directly using DTExec or another tool—for exam-
ple, DTLoggedExec—you can choose the job step type Operating System (CmdExec). With
this job step, you can call any application you want to use, just like if you’re invoking it from
the command line.

Introduction to SSIS Package Security


When a package runs, it might need to interact with the operating system to access a direc-
tory where it has to read or write some files, or it might need to access a computer running
SQL Server to read or write some data. In any of these cases, the package needs to be recog-
nized by the target entity—whether it is the operating system or another kind of server, from
560 Part III Microsoft SQL Server 2008 Integration Services for Developers

databases to mail servers—so that it can be authenticated and authorized to perform the
requested activity. In other words, the package needs to have some credentials to present to
a target system so that security can be checked.

Understanding under which credentials a package will run is vital to avoid execution failures
with packages deployed on servers and scheduled to be executed by an automation agent
such as SQL Server Agent. This is a common source of questions and consternation among
the DBA community, so we recommend you read the next section particularly closely. We
almost never see execution security set up correctly in production. In the most common situ-
ation, the DBA attempts to run the package using (appropriate) restricted security and the
package fails. Rather than troubleshooting, what we often see is a reversion to no security—
that is, running the package with full administrative permissions. Don’t replicate this bad
practice in your production environment!

When a package is executed interactively—such as when you run a package from BIDS, from
DTExcecUI, or by invoking it from the command line using DTExec—it’s more obvious under
which credentials the package is run. That is, it’s more obvious which user has executed the
package. In this case, each time the package executes and accesses a resource that uses
Windows Authentication, it presents to the target system the credentials of the user who ran
the package. So, for example, if in your package you have a connection to SQL Server that
makes use of Windows Authentication or you need to load a file from a network share and
you’re executing the package using DTExecUI, your Windows user account needs to have
appropriate permission to access SQL Server and to read from or write to tables used in the
package, and the same account will also be used to verify the credentials to access any files
stored locally or in network shares.

Understanding the security execution context becomes a little more complex when you
schedule package execution with SQL Server Agent. In this situation, you must under-
stand the key role that is played by the user who owns the job, which typically is the user
who creates the job. Be aware that job ownership can be modified by a SQL Server system
administrator.

If the user who owns the job is a sysadmin, the job step SQL Server Integration Services
Package will be run under the credentials of the account used by the SQL Server Agent
Service. As we mentioned earlier, sysadmin is the most highly privileged account, and you
should refrain from using this account as a security credential for production ETL SSIS pack-
ages. Using sysadmin introduces an unacceptable level of security risk into production
environments.

The preferred way to appropriately execute SSIS packages using SQL Server Agent is to use
a proxy account. By default, the SQL Server Agent job step is initially configured to run as
the SQL Server Agent Service Account in the Run As combo box of the Job Step window as
shown in Figure 18-20.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 561

FIgure 18-20 The Run As combo box in the Job Step window, configured to use the SQL Server Agent
Service Account

To create a proxy account, you first need to ask your system administrator to create a domain
user that will be used by SSIS to run your package. Suppose that the Windows domain user is
called MYDOMAIN\SSISAccount. After the domain account is created, you associate this new
account’s login name and password inside SQL Server so that these values can be used later
to impersonate the account and execute SSIS packages under its credentials.

To do that, you first have to open the Security item in Object Explorer in SSMS. Right-click
Security and select New Credential on the shortcut menu. The New Credential dialog box
opens, and here you specify a name for the credential and the login name and password for
the MYDOMAIN\SSISAccount user as shown in Figure 18-21.

FIgure 18-21 The New Credential dialog box

This SQL Server object, also called credential, simply holds—in a secure way—the specified
login name and password. The credential object needs a name, just like any other object
in SQL Server, and here we use the name SSIS Account Credentials. After creating the cre-
dential object, you can define the proxy account. Basically, a proxy account is an object that
defines which SQL Server user can use the chosen credential object and for which purpose.
The proxy account can be found and defined under the SQL Server Agent folder in Object
Explorer, in the Proxies folder. Each proxy account type will be displayed in the folders for
types associated with it—that is, ActiveXScript, Operation System (CmdExec), SQL Server
Integration Services Package, and so on. You can see this by taking a look at the SQL Server
Agent/Proxies folder in SSMS.

As you can see, you can define proxy accounts for any of the SQL Server Agent job steps that
need access to resources outside SQL Server boundaries and thus need to authenticate each
request. Right-clicking on the Proxies item allows you to select the Unassigned Proxy com-
mand, which will allow you to create a new proxy assignment. The New Proxy Account dialog
box, where you can create your proxy account, appears, as shown in Figure 18-22.
562 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 18-22 New Proxy Account dialog box

Here you can set the name of your proxy account, the credentials that this proxy account will
use when it needs to be authenticated outside SQL Server by the Windows operating system,
and the subsystems that are authorized to use this proxy account. If you’re the SQL Server
DBA, on the Principals page you can also specify which SQL Server users are authorized to
use this proxy account. After creating the proxy account, you can use it when setting up
the SQL Server Agent job step as shown in Figure 18-23. You can specify the proxy account
instead of SQL Server Agent Service Account in the Run As combo box.

FIgure 18-23 The Run As combo box in the Job Step window, now configured to use the created
proxy account

Each time SQL Server Agent executes the package using this step, the package will run under
the credentials of the MYDOMAIN\SSISAccount account. Therefore, you have to configure all
the resources used by this package (SQL Server, file system, mail servers, and so on) so that
they will authenticate and authorize MYDOMAIN\SSISAccount. When defining permission for
that account, always keep security in mind and give that account only the minimum permis-
sion it needs to perform the work defined in the package.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 563

Handling Sensitive Data and Proxy Execution Accounts


As we discussed earlier in this chapter, each time you save a package with sensitive data to
a file, that data needs to be removed or protected. If you save a package with the option
Encrypt Sensitive With User Key, only the user who saved the package can decrypt the sen-
sitive data. This often causes decryption issues when scheduling the package execution,
because the executing user account usually differs from the user account who created and
saved (and encrypted) it.

This means that as soon as the executing account tries to open the package, it won’t be able
to decrypt sensitive data, and any connection to resources that require login or password
information in order to authenticate the request will fail. You might be thinking that you can
create a proxy account that allows the package to use the same account of the person who
saved it, but this isn’t a recommended practice because you might have more than one per-
son (each with his own account information) who needs to execute the package on a regular
basis.

To solve this problem, you can choose to save the package using the Encrypt Sensitive With
Password option so that all you have to do is provide to the SQL Server Agent Integration
Services job step the password to decrypt sensitive data. Unfortunately, although this solu-
tion encrypts the package, it’s rather ineffective because it provides only weak security.
Passwords are often compromised through inappropriate sharing, and also they’re often
forgotten.

For these reasons, we avoid package deployment to the file system and prefer package
deployment on SQL Server MSDB. Here packages are not encrypted, and security is enforced
and guaranteed by SQL Server. Another option we sometimes use is the DontSaveSensitive
option, which stores the sensitive information in a secured configuration location. Inside the
msdb, there are three database roles that allow you to decide who can and cannot use pack-
ages stored in that way:

■■ db_sissadmin Can do everything (execute, delete, export, import, change ownership)


on packages
■■ db_ssisltduser Can manage all their own packages
■■ db_ssisoperator Can execute and export all packages

These are the default roles; to give a user the permission of a specific role, all you have to do
is add the user to the appropriate role. As usual, all the SQL Server accounts that are part of
the sysadmin server role can also administer SSIS without restrictions because they’re implic-
itly part of the db_ssisadmin role. Of course, you can also choose to create your own roles
with specific permissions. If you want to do this, you should refer to the SQL Server Books
Online topic, “Using Integration Services Roles.”
564 Part III Microsoft SQL Server 2008 Integration Services for Developers

Security: The Two Rules


As you can see, security make things a little bit more complicated at the beginning. However,
because with SSIS packages you can interact with any type of resource (from those stored in
local files to databases to network-accessible or even Internet-accessible resources and sys-
tems through Web services or an FTP), security must be appropriately constrained so that no
harm—voluntary or involuntary—can be done to the system.

Here are two best practices:

■■ If possible always use Windows Authentication so that you don’t have to deal with sen-
sitive data. In this case, you can simply deploy packages to the file system or the SSIS
Store and just configure the proxy account to use.
■■ If you need to store sensitive data in your package, use the msdb database as a storage
location so that you can control who can access and use that package simply by using
the SQL Server security infrastructure, thereby avoiding having package passwords
passed around in your company.

The SSIS Service


The SQL Server Integration Services service is a Windows service that monitors and catalogs
packages. Although you can manage—that is, configure user accounts, start or stop the ser-
vice, and so on—this service in the Windows Control Panel, you’ll probably use SSMS to work
with the SSIS service because of the greater functionality exposed through this interface. You
can connect to that service from within SSMS by using the Object Explorer window and click-
ing Integration Services. Only Windows Authentication is supported for connecting to the
SSIS service.

After you’ve successfully connected to the SSIS service, you can view all the packages saved
in the SSIS Package Store and in the msdb database in the Object Explorer window of SSMS.
This is shown in Figure 18-24. You can also view all the currently running (executing) packages
on that instance, regardless of the method that they’ve been executed with—that is, from
BIDS, from DTSEXECUI, and so on.

Note To associate more than one physical folder with the SSIS Package Store, you need to edit
the %Program Files%\Microsoft SQL Server\100\DTS\Binn\MsDtsSrvr.ini.xml file to add other file
locations.
Chapter 18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services 565

FIgure 18-24 The SSMS menu items for the SSIS service

For each running package, you can halt package execution in this window. Also, for all the
stored packages, you can perform administrative tasks, such as creating new folders, import-
ing and exporting packages, executing packages, or deleting packages. Your account must
have appropriate permission to perform these types of tasks—that is, stopping, configuring,
and so on—to successfully complete them in the SSMS interface.

Tip We already mentioned the usefulness of BIDS Helper, a free tool for SSAS cubes. BIDS
Helper is available from CodePlex at http://www.codeplex.com/bidshelper. The tool includes a
number of utilities that make working with SSIS packages easier. These tools include the follow-
ing: Create Fixed Width Columns, Deploy SSIS Packages, dtsConfig File Formatter, Expression
And Configuration Highlighter, Expression List, Fix Relative Paths, Non-Default Properties Report,
Pipeline Component Performance Breakdown, Reset GUIDs, Smart Diff, Sort Project Files, SSIS
Performance Visualization, and Variables Window Extensions.

Summary
In this chapter, we covered all the aspects of SSIS package versioning and deployment. We
then took a closer look at security options that are involved in package deployment. We
advise you to take to heart our recommendations for using appropriate security for execut-
ing packages. You can easily expose unacceptable security vulnerabilities if you do not follow
best practices.

You’ve also learned how to monitor and manage packages through the SSIS service and how
to deal with security in this area.
Chapter 19
Extending and Integrating SQL
Server 2008 Integration Services
As we discovered in previous chapters, SQL Server Integration Services (SSIS) gives us a lot of
power and flexibility right out of the box. With dozens of tasks and components ready to use,
a lot of the typical work in an extract, transform, and load (ETL) phase can be done by just
dragging and dropping objects, and configuring them appropriately. But sometimes these
included tasks are just not enough, particularly because business requirements or data trans-
formation logic can get complex.

SSIS provides an answer even to these demanding requirements we find in business intelli-
gence (BI) projects. This answer lies in using very powerful scripting support that allows us to
write our own custom logic using .NET code. In this chapter, we show you how to use script-
ing and create custom objects, as well as how to invoke the power of SSIS from your custom
applications.

Introduction to SSIS Scripting


Scripting in SQL Server 2008 Integration Services is available through two different objects in
the Toolbox. The Script Task enables scripting usage in the control flow, and for scripting in
the data flow you use the Script Component. As you’ll see later, transforming data is just one
of the many things you can do with the Script Component.

The two objects have a lot of things in common but are not identical. Commonalities include
the fact that both objects deal with the .NET Framework, which results in the scripting capa-
bilities being common to both objects. For both, you use the same integrated development
environment—Microsoft Visual Studio Tools for Applications—and also for both, you can
choose between C# and Visual Basic .NET as a scripting language. The differences are instead
related to the different purposes of the two objects. In general, the ability to write scripts in
C# is new to SQL Server 2008. In previous versions, you were limited to writing in Visual Basic
.NET only.

The Script task lives in the control flow, and its script can accomplish almost any generic func-
tion that you might need to implement. If the Script task is not explicitly put into a Loop con-
tainer, it will run only once and will not have any particular restriction in the way it can access
package variables.

The Script component lives in the data flow, so its script code—not all of it, but at least the
main processing routine—will be executed as many times as needed to process all the data
567
568 Part III Microsoft SQL Server 2008 Integration Services for Developers

that comes through the component. The script here works directly with data that flows from
the sources to the destinations, so you also need to deal with columns and flow metadata.
You can decide whether to use the Script component as a source, destination, or transforma-
tion. Depending on your decision, you need to implement different methods in the script so
that you can consume or produce data, or both.

In addition to these generic differences, there are also other differences related to the way in
which the scripts can interact with the package. As you’ll see in more detail in the upcoming
pages, there are differences in how to deal with package variables, logging, debugging, and
so on.

Because the Script component is more complex than the Script task, we’ll begin by taking
a look at how the Script task works and what types of business problems you can use it to
address, and then we’ll turn our attention to the Script component. Because some concepts
are shared between the two objects but are simply implemented in a different way, drilling
down into the Script component after taking a closer look at the Script task makes sense.

Visual Studio Tools for Applications


When implementing SSIS scripting, you will use a new interface in SQL Server 2008. SQL
Server 2008 includes Visual Studio Tools for Applications (VSTA) rather than Visual Studio for
Applications (VSA), which was the default scripting environment used in SQL Server 2005.

This is a big improvement because VSTA is a full-featured Visual Studio shell. So you finally
have full access to all Visual Studio features; you’re no longer limited to referencing only a
subset of assemblies—you can use and reference any .NET assembly you might need. You
can also reference Web Services simply by using the standard Add Web Reference feature.

The Script Task


The Script task allows you to add manually written business logic into the control flow. Let’s
take a closer look now at exactly how this task works.

To get started, create a new SSIS package in Business Intelligence Development Studio (BIDS).
Drag the Script Task from the Toolbox onto the designer surface. After you have dropped the
appropriate object on the control flow surface, double-click on the newly created Script Task
box to open the Script Task Editor dialog box. Note that the default scripting language is C#.
Alternatively, you can write your SSIS scripts in Visual Basic .NET. The language type is config-
ured at the level of each Script task using the ScriptLanguage property. Once you have begun
editing the script, you will not be able to change the language, so make sure you pick the
appropriate one before clicking the Edit Script button. Note also that the default entry point
(or method) is named Main. This is shown in Figure 19-1.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 569

FIgure 19-1 Script Task Editor

In addition to choosing the script language and the name of the entry method, you can also
use this dialog box to define variables that your script will use. You can define the usage of
variables associated with your Script Task as either ReadOnly or as ReadWrite.

The SSIS runtime will try to scale as much as possible to make your package run faster and
faster, and this means that anything that could be run in parallel is likely to be executed in
parallel with other tasks. The SSIS runtime needs to know which tasks can run in parallel with-
out interfering with each other. If two tasks both need to access variable values in read/write
mode, it’s better not to have them work on the same variable in parallel; otherwise, you’ll
surely have some unpredictable results, because there’s no guarantee of the order in which
the tasks will access the variable.

Note You should be aware that the SSIS runtime doesn’t prevent tasks from running in parallel,
even if they both access the same variable. Also, the locking method functions in a way that you
might not expect. For more information, see the following post: https://forums.microsoft.com/
forums/showpost.aspx?postid=3906549&siteid=1&sb=0&d=1&at=7&ft=11&tf=0&pageid=0). One
example of this unexpected behavior is that locking a variable for read doesn’t stop you from
writing to it.
570 Part III Microsoft SQL Server 2008 Integration Services for Developers

Note that here you don’t have to prefix the package variable name that you want to be
able to access from your script’s code with the @ character, because you’re not specify-
ing an expression, just package variable names. We remind you that SSIS variable names
are always case sensitive. Specifically, this is important to remember if you intend to write
your script in Visual Basic .NET. Figure 19-2 shows an example of some configured script
properties. We’ve defined a couple of variables in the Script Task Editor dialog box. These
are the ReadOnlyVariables properties named User::FileName and User::Path, and the
ReadWriteVariables property named User::FileExists. We’ll refer to these variables in the script
that we write to check for the existence of a file later in this section.

FIgure 19-2 Script variable properties

Note You can also lock variables directly within the script, using the Dts.VariableDispenser
object. The advantage of using this over the ReadOnlyVariables or ReadWriteVariables properties
is that the variables are locked for a shorter period, and you have explicit control over when the
lock starts and stops. Here is an example of this code:
Variables vars = null;
Dts.VariableDispenser.LockOneForRead("FileLocked", ref vars);
vars["FileLocked"].Value = true;
vars.Unlock();

Having defined everything you need, you can now start to write the script. To do this, you
need to use Visual Studio Tools for Applications, which will run after you click the Edit Script
button in the Script Task Editor dialog box. After VSTA is loaded, you’ll see that some auto-
generated code is already present. Take a look at the autogenerated Main method shell; it is
here where you’ll write the majority of the code you want to be executed.

As you author the script, remember that you can use any feature that your selected .NET
language supports. Because you’re using a Visual Studio shell, you can reference other
assemblies simply by using the standard Visual Studio item Add Reference. In the same
way, you can also add references to external Web Services by using the Add Web Reference
menu item.

Note The terms assembly and namespace are frequently used in .NET terminology. Assemblies
are basically DLL or EXE files that provide some functionality. Those functionalities are provided
by classes that make them available through methods and properties. Classes are organized into
namespaces that are contained in and across assemblies.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 571

The Dts Object


To access any package variable, you have to use the Dts object that the SSIS engine exposes,
and use its Variables collection property to access the particular variable object that you want
to work with. After you have access to the package variable, you can use its Value property
to get or set the variable value. Because this is a common source of bugs, we remind you
once again that all variable names are case sensitive—regardless of whether you’re using
C# or Visual Basic .NET as your scripting language. Note that we included a reference to the
System.IO namespace in our sample so that we can use objects and methods that work with
the file system in our script task. In the following example, we’ve used a simple assignment to
set the value of the User::FileExists variable:

public void Main()


{
string folderPath = (string)Dts.Variables["User::Path"].Value;
string fileName = (string)Dts.Variables["User::FileName"].Value;

string filePath = System.IO.Path.Combine(folderPath, fileName);

Dts.Variables["User::FileExists"].Value = System.IO.File.Exists(filePath);

Dts.TaskResult = (int)ScriptResults.Success;
}

In addition to using the Variables collection, the Dts object exposes several interesting prop-
erties that allow you to programmatically interact with the package. One of the most com-
monly used properties is TaskResult. This property is needed to notify the SSIS runtime if the
execution of your script has to be considered as a success or as a failure so that the workflow
defined in the control flow can follow the related branch.

The other properties exposed by the Dts object are the following:

■■ Log Allows you to write logging information that will be used by the SSIS logging
infrastructure.
■■ Connections Gives you access to the connection managers present in the SSIS pack-
age, allowing you to connect to the related data source.
■■ Events Allows you to fire events—for example, FireInformation and FireProgress.
■■ Transaction Permits the script to join or manage a transaction by indicating the status
of the transaction through the task execution result. For example, the success of the
task can equate to the success of the entire transaction, or the success of the task can
cause a series of tasks in the transaction to continue.
■■ ExecutionValue In addition to indicating a simple Success or Failure result, you might
need to communicate to the control flow an additional value.
For example, you might need to check whether a file exists in a specific directory. If it
does exist, different branches must be followed in your control flow, depending on the
572 Part III Microsoft SQL Server 2008 Integration Services for Developers

name of that file. You can assign the name of the file to the Dts object’s ExecutionValue
property so that it will be available in the control flow. To access in the control flow, you
must point the ExecValueVariable Script task property to an existing package variable.
This variable will hold the value put into the ExecutionValue property from the script
code. In that way, you can easily use that value in an expression or in another task for
further processing.

After you’ve finished modifying the script to suit your business requirements, you can simply
close Visual Studio Tools for Applications, and the script will be automatically saved.

Debugging Script Tasks


A Script task can be easily debugged. As we saw in Chapter 15, “Creating Microsoft SQL
Server 2008 Integration Services Packages with Business Intelligence Development Studio,”
and Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services,” SSIS
offers many options for debugging in the BIDS environment. You’ll be happy to know that
these extend to the Script task as well. SSIS script debugging works as it does in any other
standard .NET application.

With VSTA open, you need to place the cursor in the line of code upon which you want
to have the execution halted, open the shortcut menu with a right-click, and click Insert
Breakpoint:

After you set the breakpoint, you can run the package in debug mode in BIDS. The Script
task will have a red circle on it, indicating that the script has a breakpoint set. The execution
will stop and enter into interactive debugging mode as soon as it reaches the breakpoint. All
of the standard debugging windows—that is, locals, autos, and so on—are available for SSIS
script debugging.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 573

The Script Component


Now that we’ve taken a look at how you can use the Script task in the control flow, we’ll turn
next to using the Script component in a data flow. Here you’ll see how you can leverage the
power of scripting while processing data.

To take a closer look at the Script component, you need to drag one Data Flow task onto
your sample package’s designer surface from the Toolbox. Double-click the Data Flow task to
open the data flow in design view. Next, select the Script Component from the Toolbox and
drop it onto the data flow designer surface.

As soon as you drop the object, the Select Script Component Type dialog box (shown in
Figure 19-3) is displayed, which enables you to specify how you intend to use this Script
component in the data flow. In the Select Script Component Type dialog box, select either
Source, Destination, or Transformation (the default value). For this example, choose the
Source setting.

FIgure 19-3 Script component type options

The Script component can behave in three different ways, depending on how you elect to
use it. It can be used as a data source so that you can generate data that will be used in
the data flow. This is useful, for example, when you need to read data from a custom-made
source file that doesn’t conform to any standard format. With a source Script component,
you can define the columns that your scripted source will provide, open the file, and read
data, making it flow into the data flow. Of course, before you take the time and effort to
manually write such a script, you should exhaust all possibilities for locating and down-
loading any additional data source components that Microsoft or the SSIS community
might have already developed and made available. We recommend checking CodePlex
(http://www.CodePlex.com) in particular for this type of component.
574 Part III Microsoft SQL Server 2008 Integration Services for Developers

A destination Script component is similar to a source, but it works in the opposite way. It
allows you to take the data that comes from the data flow and store it in a custom-made
destination. Again, this is typically a destination file with custom formatting.

The data flow transformation Script component is able to take input data from another data
flow component, transform that data via custom logic, and provide the transformed data as
an output for use by other data flow components. We most often use the Script component
in SSIS package data flows to execute custom transformation logic, so we’ll focus the remain-
der of our discussion on this particular type of implementation.

After selecting the Transformation option and clicking OK, you’ll see an object named Script
Component on the data flow designer surface, and you’ll also see that the component is in
an error state, as indicated by the red circle containing a white x on the Script component,
shown in Figure 19-4.

FIgure 19-4 Script component in an error condition

The error state is correct because a transformation Script component needs to have one
or more inputs and one or more outputs, and you haven’t added either of these yet in this
example. So add at least one data flow source and connect it to the Script component. For
this example, use the sample database (from CodePlex) AdventureWorks 2008. After you’ve
done this, open the Script Transformation Editor dialog box by double-clicking on the Script
Component surface, as shown in Figure 19-5.

In the Custom Properties section, you can define the language you want to use for script-
ing and the variables you’ll use in the script. This is the same process you followed earlier for
the Script task in the control flow. Before starting to edit the script’s code, you also need to
define which of the columns that come from the source will be used by your script and in
which mode, ReadWrite or ReadOnly. You do this by configuring the Input Columns page
of this dialog box. We’ve selected a couple of columns to use for this example, as shown in
Figure 19-6.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 575

FIgure 19-5 Script Transformation Editor dialog box

FIgure 19-6 Input Columns page

Because the transformation Script component can handle more than one input, you can
configure the columns you need to use for each input flow using the Input Name combo box
near the top of the window. You simply switch from one input to another. Only the columns
576 Part III Microsoft SQL Server 2008 Integration Services for Developers

you select will be available to be processed in the script. Note also that you can configure the
Usage Type for each selected column. You can choose either ReadOnly or ReadWrite.

On the Inputs And Outputs page of this dialog box, you can view detailed information about
the configured input and output columns. Here, you are not limited to working on exist-
ing columns; you can also add new columns to the output. New columns are often useful
because when you do some processing in the script, you might need a new place to store the
processed data; in these cases, a new output column works perfectly.

Output columns have to be configured before you start to write a script that references them.
To switch to the Inputs And Outputs page, click Inputs And Outputs in the left column, as
shown in Figure 19-7.

FIgure 19-7 Inputs And Outputs page

Selecting an output and clicking Add Column creates a new output column. You can also
have multiple outputs, but for this example, you just need one output.

Each new output has a default name of Output <n>, where n is a number that will start from
zero and be incremented for each new output. A best practice is to define a customized
meaningful name for newly added output columns; you’ll do that in this example. To add
a new output, click the Add Output button in the Script Transformation Editor dialog box.
The new output, named FullName Output in this example, implicitly includes all the columns
from the input. If you need to add more output columns, click the Add Column button and
then configure the name and data type of the newly created output column. This is shown in
Figure 19-8.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 577

FIgure 19-8 Adding a new column to the output

Now that everything has been configured, click the Edit Script button on the Script page of
the Script Transformation Editor dialog box so that you can begin to write the script logic you
need. After you do this, VSTA loads.

Now take a closer look at the autogenerated code. This code includes three methods:

■■ PreExecute
■■ PostExecute
■■ Input0_ProcessInputRow

If you’ve changed the name of the input flow from the default value of Input 0, the name of
the latter method will be different and will reflect, for the part before the underscore charac-
ter, the revised input flow name.

As we continue our discussion, we’ll examine in greater detail how to use each of these meth-
ods (which also depends on the usage of the Script component type that you’ve selected—
that is, Source, Destination, or Transformation). However, we’ll leave the discussion for now,
because before starting to write the script you need to understand how it’s possible to inter-
act with the package and the data flow metadata.

In the Script component, the Dts object (that we used previously in the Script task) is not
available. To access the package variables configured to be used in the script here, you use
the Variables object. For this example, if you add two package variables, named RowCount
and BaseValue, to the ReadOnlyVariables property, you can use them through the Variables
object.

In the example, as the name suggests, RowCount contains the number of processed rows and
the BaseValue variable allows your script to start to count that number not only from one but
from any arbitrary number you want to use. This is particularly helpful when you have a data
flow inside a Loop container and you have to count the total number of processed rows, by
all data flow executions.
578 Part III Microsoft SQL Server 2008 Integration Services for Developers

It is important to consider that variables added to the ReadWriteVariables property can be


used only in the PostExecute method. This restriction is enforced to avoid the performance
impact of the locking mechanism that SSIS needs to use to prevent possible conflicts if two
different data flows were to attempt to modify the same package variable value. Of course,
local variables (declared inside the script) can be used in any section of the script, and they’re
subject to the normal .NET variable scoping rules. So the trick here is to create a local variable
named _rowCount and use it to keep track of how many rows have been processed by our
script. You do this by assigning this local value to the package variable just before the end of
the Data Flow task, in the PostExecute method.

Inside the Script component’s code, you’ll sometimes also need to access data flow meta-
data—in particular, columns related to the flows that come into the component. For this
situation, the Input0_ProcessInputRow method provides a parameter named Row. This object
exposes all selected columns as properties, and all you have to do is just use them according
to the usage type you’ve defined for the various columns (ReadOnly or ReadWrite). In the fol-
lowing code example, the Row object can be used in the Input0_ProcessInputRow method’s
body:

public override void Input0_ProcessInputRow(Input0Buffer Row)


{
_rowCount += 1;

Row.FullName = Row.FirstName + " " + Row.MiddleName + ". " + Row.LastName;


}

The preceding example is obviously oversimplified. You should exhaust all built-in com-
ponents before you write manual scripts. This is because of the time required to write and
debug, in addition to the overhead introduced at package execution time. If you just need a
new column built from simple input concatenation, you should use a built-in transformation
(such as the Derived Column transformation) rather than writing a script. We emphasize this
point because, particularly for users with .NET development backgrounds, the tendency will
be to favor writing scripts rather than using the built-in components. We challenge you to
rethink this approach—we aim to help you be more productive by using SSIS as it is designed
to be used. We have observed overscripting in many production environments and hope to
help you learn the appropriate use of the script tools in SSIS in this chapter. To that end, let’s
drill a bit deeper into working with scripts.

An example of more complex logic is the requirement to have more than one output. We’ll
show the script to separate input data into two or more output paths. As we mentioned, add-
ing new outputs is just a matter of some clicks—a more important question is how to man-
age to put a particular row into a particular output.

The first thing to do after adding an output is to configure it so that it is bound to the
input data. This operation is done by selecting the new output on the Inputs And Outputs
page of the Script Transformation Editor, and then selecting the appropriate input in the
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 579

SynchronousInputID property. This is shown in Figure 19-9. For the example, an output
named UserHashCode Output should be created, and bound to Input 0. One new column
named HashCode should be added to the new output.

FIgure 19-9 Associating new output to existing input

In this way, all output contains all the input columns plus any columns added to the specific
output. Thus, all the input data will flow through all the outputs. However, the data for the
column created by the script will be available only for the output flow that contains that col-
umn. You should also note that there’s a difference between a synchronous output and an
asynchronous one—setting the SynchronousInputID property makes an output synchronous;
choosing None makes it asynchronous.

So, basically, if a thousand rows pass across the data flow via the script’s execution, all output
will have a thousand rows, effectively duplicating all the input values. You can change this
default behavior by correctly defining the ExclusionGroup property. All synchronous outputs
need the same ExclusionGroup setting if they are in the same group. For the example, both
outputs need the same ExclusionGroup value set, not just the new one.

This property is found on the same Inputs And Outputs page of the Script Transformation
Editor dialog box, and it has a default value of 0 (zero). After you change the property value
to any non-zero number, only the rows that you explicitly direct to that output will flow
there. Figure 19-10 shows this property.

FIgure 19-10 Setting the ExclusionGroup property to a non-zero value redirects output.

The logic that decides where to direct the incoming row’s data lies in the script code.
Here, the Row object exposes some new methods that have their names prefixed with a
DirectRowTo string. You compose the method’s full name by using this prefix and then add-
ing the name of the output flow that will be managed by that method. These methods allow
you to decide in which of the available outputs the input row should go. You’re not limited to
sending a row to only one output; you can decide to duplicate the data again by sending the
row to multiple output flows.
580 Part III Microsoft SQL Server 2008 Integration Services for Developers

public override void Input0_ProcessInputRow(Input0Buffer Row)


{
_rowCount += Variables.BaseValue;

string fullName = Row.FirstName + " " + Row.MiddleName + ". " + Row.LastName;

Row.DirectRowToFullNameOutput();
Row.FullName = fullName;

if (Row.EmailPromotion != 0)
{
Row.DirectRowToUserHashCodeOutput();
Row.HashCode = (fullName + Row.EmailAddress).GetHashCode();
}
}

In the example shown, all rows will be directed to the FullNameOutput output, and only rows
whose EmailPromotion column value is not zero will be directed to the HashCodeOutput
output.

The ComponentMetaData Property


Within the script code, you can choose to write some log information or fire an event, as you
did in the Script task. Keep in mind that in the Script component there is no Dts object avail-
able. Rather, for these purposes, you use the ComponentMetaData property. This property
allows you to access log and event methods so that you can generate logging data (using the
PostLogMessage method) and fire events (using the FireInformation, FireError, or other Fire . . .
methods).

The Connections property allows you access to the connection managers defined in the
package. To use a connection manager in the Script component, you first have to declare its
usage in the Script Transformation Editor dialog box (found on the Connection Managers
page). If you need to access connection manager information by using the Connections prop-
erty in your script, you must first define the connection managers on this page, as shown in
Figure 19-11.

FIgure 19-11 Connection Managers page of the Script Transformation Editor


Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 581

Using a connection manager is a convenient way to make your script easily configurable with
custom connection information. As we mentioned, this type of dynamic connection configu-
ration information is a preferred package design method. In this example, you have a script
that needs to read from or write to a file. You’ll define the file name in a package variable and
then associate that variable with a configuration file using the configuration feature of SSIS.
Follow best practice by creating a connection manager that uses an expression to configure
its ConnectionString property to use the file name stored in the variable. In this way, you’ll
have a solution that is understood by the SSIS architecture, is reusable by other tasks or com-
ponents, and is easily configurable. We discussed the procedure for configuring connection
managers in Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008
Integration Services.”

So, if in your script you need to access any resources external to the package (such as a
text file), you should do that by using a connection manager, thereby avoiding accessing
the resource directly. After you configure the connection manager, you use it to get the
information on how you can connect to the external resource. For a text file, for example,
this information will simply be the file path. To get that information, you override the meth-
ods AcquireConnections and ReleaseConnections to manage the connections to external
resources. In the aforementioned example, access to a file name and path through a defined
file connection manager can be set up as shown in the following partial code snippet from
the AquireConnections method:

public override void AcquireConnections(object Transaction)


{
_filePath = (string)this.Connections.MailingAddresses.AcquireConnection(Transaction);

The AcquireConnections method returns the correct object you need to use to connect with
the external resource. For a file, it is a string containing the path to the file; for a SQL Server
connection (using a .NET provider), it is an SqlConnection object; and so on. Be aware that
AcquireConnections returns a generic Object reference, so it’s your responsibility to cast that
generic Object to the specific object you need to use to establish the connection with the
external resource.

The Transaction object is a reference to the transaction in which the component is running.
Just like all the other methods that you override, the AcquireConnections method is called
automatically by the SSIS engine at runtime and during the package validation. When the
engine calls that method, the Transaction parameter allows you to know whether the com-
ponent is working within a transaction or not. If the data flow that contains that component
is taking part in a transaction (for example, it has its Transaction property set to Required,
or any of the containers that contain the data flow has that property set to Required), the
Transaction object passed to the AcquireConnections method will not be null. The Transaction
object can then be passed to the connection manager you’re using so that the connection
manager has the information it needs to take part in that transaction, too. To do that, pass
the Transaction object to the AcquireConnections method of the connection manager you’re
going to use in your component to allow that component to be transactional.
582 Part III Microsoft SQL Server 2008 Integration Services for Developers

Source, Transformation, and Destination


Now that we’ve covered the mechanics of the Script component, it’s time to dive a little
deeper into the differences between the three possible behaviors you can associate with this
component. We’ll start first by taking a closer look at using it as a (data) source.

Source
A source Script component can have one or many output rows but no input rows. As you
start to edit the code, you’ll see that the method ProcessInputRow is not present. Rather,
another method exists, named CreateNewOutputRows. Here is where you can write the code
that generates output rows.

Inside that method, if you need to generate new output rows by using a script, you can
do so by invoking the method AddRow on the output buffer object. An output buffer
object will be created for any output flow defined, and its name will be something like
<NameOfTheOutput>Buffer.

The following code fragment shows how it’s possible to add rows to an output flow named
Authors:

else if (line.ToLower().StartsWith("by"))
{
// Authors
authorLine = line.Substring(2).Trim();

string[] authors = authorLine.Split(',');


foreach (string a in authors)
{
AuthorsBuffer.AddRow();
AuthorsBuffer.Name = a.Trim();
}
}

Synchronous and Asynchronous Transformation


Although you’ve been introduced to the functionality of the Script component, we need to
show you another important aspect of its functionality. In all the examples you’ve seen up
until now, all the rows that flowed into this component were processed as soon as they came
up and were then immediately sent to one of the available output flows. This kind of behav-
ior makes this component a synchronous transformation because, as the name suggests, the
output is synched with the input—that is, it is processed one row at time. To be more precise,
the data doesn’t come into the component one row at time; rather, it’s grouped into buffers.
Anyway, the buffers—at this level—are transparent to the user, so we’ll simplify things by say-
ing that you deal with data one row a time.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 583

Although synchronous processing is adequate for some business situations, in others you
might want to use a more flexible and scalable method. For example, if you want to do some
aggregation processing on a large incoming dataset, you might not want to wait to receive
all the incoming rows before starting to produce any output. In this case, you might prefer to
configure your component to use asynchronous processing, otherwise known as asynchro­
nous transformation. Of course, aggregation is not the only case in which you might choose
to use an asynchronous transformation. Some other example of transformation types that
use asynchronous processing are the native Merge and Sort transformations.

To define whether a component will be synchronous or not, you use the same Inputs And
Outputs page of the Script Transformation Editor dialog box, shown in Figure 19-12. Here, for
each defined output, we can find the property SynchronousInputID.

FIgure 19-12 The SynchronousInputID property defines asynchronous behavior.

To define an asynchronous output, all you need to do is set the value for SynchronousInputID
to None. This also indicates that, for the specified output, the only available output columns
will be the ones that you add. The input columns won’t be automatically available like they
are for synchronous transformations.

Tip Because the SynchronousInputID property is configured on an output basis, a component


with multiple outputs can have both behaviors.

Keep in mind that when you add a transformation script component to your package it auto-
matically includes a configured output, which will be configured to be synchronous with the
input by default. If you need to add more outputs, all of the manually added outputs will be
automatically set to be asynchronous. Apart from these differences, the methods you use in
your script for synchronous and asynchronous transformation are the same, with some addi-
tional methods typically being used for the last one (for example, ProcessInput).

To create a synchronous script, you use—as we’ve said before—the following methods, which
will be made available automatically by autogenerated code:

■■ PreExecute
■■ PostExecute
■■ Input0_ProcessInputRow
584 Part III Microsoft SQL Server 2008 Integration Services for Developers

For asynchronous output, you’ll also find that the autogenerated code contains the method
CreateNewOutputRows.

The Input0_ProcessInputRow method allows you to create output rows, and it works in the
same as it works for a transformation Script component. However, in this case, this method
is not as useful because you probably need to manipulate data that is coming from an input
flow and generate output data, and the CreateNewOutputRows method gets called by the
SSIS engine only one time and before the ProcessInputRow data, not after.

This is a typical scenario when you’re creating a script transformation that aggregates data—
and thus for a certain number of input rows you have to produce a certain number of output
rows. However, the number of output rows is completely different and independent from the
input, and also the output rows have a completely different structure than the input rows.

As an example, consider this problem: as input rows, you have some string values; as output,
you have to generate a row for any used alphabetical letter, and for each of them you have to
tell how many times it has been used—not only in one single row but in the whole data flow.

To illustrate, suppose you have three input rows, each with one column and containing the
values shown in Table 19-1.

TAble 19-1 Sample Scenario for Aggregating Data


Row Number Value
1 abc
2 ab
3 a

The resulting flow of aggregated values is shown in Table 19-2.

TAble 19-2 results of Sample Scenario


Letter UsageCount
A 3
B 2
C 1

Fortunately, the methods available to you from autogenerated code are only a subset of the
available methods, and a method that allows you to manipulate code such that input and
output data can manipulated in an arbitrary manner exists—it is the ProcessInput method.
Normally, it is invoked by the SSIS engine each time a new data buffer from the input flows
needs to be processed. Internally, it calls the ProcessInputRow method for any row present in
the buffer. So all you need to do is override that method, and write your own implementa-
tion, which gives you the possibility to decide what to do with the data available in the buffer.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 585

By overriding the ProcessInput method, you’re handling data as it comes into a buffer. You
might also want to define what should happen after the data is flushed out of each buf-
fer or after the entire flow has been processed. You do this by accessing the Buffer object’s
EndOfRowset method, which can tell you if there are still rows to be processed in the incom-
ing data.

Overriding ProcessInput allows you to create a very specific transformation, without any limi-
tations in terms of functionality that you might want to implement. All this power, as usual,
comes at the price of complexity, so it’s important to have a clear understanding of the dif-
ference between ProcessInput and ProcessInputRow. The first method allows you to deal with
entire buffers of data that will be processed, while the latter method gives you the simplicity
of working row by row when writing the code that will transform the data.

As you can see, the data flow process can be complex. At this point, we’ll take a look at all of
the methods you can use when developing a data flow script, as well as show when the SSIS
engine will be invoked during data flow processing.

The AcquireConnections method is called as you begin to execute your package. As soon as
the data flow starts to process your script, the first method associated with script process-
ing that gets executed is the PreExecute method. This method is executed only one time
per package execution. Then the CreateNewOutputRows method is run. This method is also
executed only one time per package execution. After that, the <input_name>_ProcessInput
method is executed, one time for each available buffer. This method internally calls the
ProcessInputRow method that is executed for each row available. Finally, the PostExecute
method gets called (one time for each execution). After the script process completes, the
ReleaseConnections method is called.

So if you have a data flow that has split the incoming data into two buffers that hold three
rows each, here is the sequence in which the previously described methods will be called:

■■ PreExecute
■■ CreateNewOutputRows
■■ Input0_ProcessInput
■■ ProcessInputRow
■■ ProcessInputRow
■■ ProcessInputRow
■■ Input0_ProcessInput
■■ ProcessInputRow
■■ ProcessInputRow
■■ ProcessInputRow
■■ PostExecute
586 Part III Microsoft SQL Server 2008 Integration Services for Developers

This is only an example. Remember that you can only choose how much memory and a maxi-
mum number of rows per buffer will be used. So the number of buffers will be automatically
calculated by the system.

Now let’s suppose that you want to count how many times alphabet letters are used in some
text that comes in as input. In the ProcessInputRow method, you create the logic to extract
and count letter usage. In the ProcessInput method, you make sure that all rows will be pro-
cessed and that at the end, and only at the end, you produce the output with all gathered
data.

public override void Input0_ProcessInput(Input0Buffer Buffer)


{
// Process all the rows available in the buffer
while (Buffer.NextRow())
{
Input0_ProcessInputRow(Buffer);
}

// If no more buffers are available we processed all the data


// and we can set up the output
if (Buffer.EndOfRowset())
{
foreach (KeyValuePair<char, int> kv in _letterUsage)
{
LettersUsageBuffer.AddRow();
LettersUsageBuffer.Letter = kv.Key.ToString();
LettersUsageBuffer.Usage = kv.Value;
}
}
}

Asynchronous outputs give you a lot of flexibility, but this comes with a price. Depending on
the business requirements that you have and on the code you write, the asynchronous out-
put can have to process a large amount of input data before it can output even a single row.
And during all that time, all the components that are connected to that output flow won’t be
able to receive any data. This is known as blocking, and it can create significant performance
degradation of your package. So you need to base such a design on business requirements
and test it using production levels of data during the development phase of your project.

Destination
When the Script component is configured to work as a destination, it will have only one
input and no output unless you configure an error output. Basically, the way in which you
use a destination Script component is similar to that of a transformation, except that you
aren’t required to produce any output rows. You put all the code you need to write in the
ProcessInputRow method by using the now familiar Row object to access the data from
the rows.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 587

You can also choose to use the PreExecute and PostExecute methods because through them
you can, for example, open the physical location at which you would like to save your data so
that it can be configured correctly. You can also close it down nicely after you have finished
your processing.

The scenario we just described is quite common because typically a destination Script
component will be used to save data in one or more custom file formats. You can open
a stream in the PreExecute method and write the file header if you need it. Then, in the
ProcessInputRow method, you can output data to the opened stream, and finally, you can
close and dispose of the used stream in the PostExecute call.

Debugging Script Components


Unfortunately, debugging of Script components through the usual usage of breakpoints is
not possible because this technique is not supported in SSIS. This also means that stepping
into the Script component while it’s running and checking runtime values through watchers is
not possible.

Tip If you want to debug, you have to use old-fashioned debugging tricks, such as printing
messages to the screen. At least you can use the logging and event-firing capabilities of SSIS. If
you really need it, you can also use the MessageBox.Show method in the System.Windows.Forms
namespace to show a pop-up message that will also block the script execution until you click the
OK button. However, this should be used only in development and removed before the pack-
age is put into production because it can cause issues when automating package execution. In
general, using the ComponentMetaData.FireInformation method is the best approach to adding
debugging information to your Script component.

Overview of Custom SSIS Task and Component


Development
Although SSIS scripting is quick and easy, it’s not the best solution for implementing business
logic for all BI solutions. As with any type of scripting, SSIS package-level scripting is isolated
to the particular package where the script was developed and can be reused in another pack-
age only by cutting and pasting the original script. If your business requirements include
complex or highly reusable custom logic that will be reused frequently across multiple SSIS
packages, you’ll want to create your own SSIS object rather than simply reuse the SSIS Script
tasks over and over.

A sample of such a custom object is a task that compresses one or more files (typically, logs)
and then moves it into an archive folder. This is a common operation that can be encap-
sulated in a custom task and used in any package your company develops in which such a
588 Part III Microsoft SQL Server 2008 Integration Services for Developers

feature is needed. Authoring an SSIS component requires that you be proficient using a .NET
language. You can, of course, use any .NET language—that is, C# or Visual Basic .NET. You
also need to have a good knowledge of SSIS’s object model.

Note As you can understand, creating a custom object is not a trivial job and it could take an
entire book by itself. For this reason, we’ll describe only the main topics here and the ones that
you need to pay the most attention to. We’ll also show code fragments that give you an idea of
what should be in a custom object. The best way to read this section of the chapter is with the
related example at hand, where you can find the entire code, run and test it, and correlate it with
the problems being discussed.

The great power of a custom SSIS object is that once you have finished its development, you
can install it in the SSRS Toolbox in BIDS and reuse it across as many SSIS packages as need
be. Also, you can share this object with other developers working on your project. We believe
there is also potential for reusable objects to be sold commercially, even though we haven’t
seen much activity in this development space yet.

Before you create a custom object you might want to do a quick search to see if your par-
ticular business problem has been solved via the creation of a commercial custom object by
another vendor. We favor buying over building because of the complexity of appropriately
coding custom SSIS objects.

Thanks to the pervasive use of .NET, SSIS is such an extensible product that you might think
of it as a framework or development platform for creating ETL solutions. Another way to
understand this is to keep in mind that you can extend nearly every object in SSIS, which pro-
vides you with many opportunities for customizing all aspects of SSIS. These customizations
can include the following:

■■ Custom tasks Create custom tasks to implement some specific business logic.
■■ Custom connection managers Connect to natively unsupported external data
sources.
■■ Custom log providers Log package events defining custom formats.
■■ Custom enumerators Support iteration over a custom set of object or value formats.
■■ Custom data flow components Create custom data flow components that can be
configured as sources, transformations, or destinations.

To develop custom objects, you can use anything that allows you to create .NET (that is, C# or
Visual Basic .NET) class libraries. Normal practice is to use a full version of Visual Studio 2008
to create .NET class libraries. Each SSIS custom object needs to provide the logic that will be
available at runtime—when the package is running—and at design time—when the package
that uses your component is being developed.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 589

For this latter purpose, all objects should also provide a user interface to interact with, where
the package developer can configure the object’s properties. Providing a user interface is not
mandatory, but it is strongly advised because an easily configurable object helps you to avoid
bugs and lower package development costs.

Before diving into the specific code for a custom object, you need to know a few things
about deploying it. After you’ve written all of the code for your SSIS object and are ready to
compile it into an assembly, you need to follow a couple of steps to be able to integrate the
compiled assembly with the SSIS engine and BIDS. These steps are generic steps that you
need to follow for any custom component you develop, so keep them in mind. They are basic
knowledge that you need to have to be able to deploy and distribute your work.

The first step you need to complete is to sign the assembly. This means that you have to cre-
ate a public/private key pair.

In Visual Studio 2008, go to the project’s properties window and click Signing in the left pane
to access the Signing page. This is shown in Figure 19-13.

FIgure 19-13 The Signing page of the SSIS project assembly

You can then choose to use an existing key file or create a new one. After you’ve completed
this step, you can proceed with building the solution as you would with any other .NET
assembly. Click Build on the shortcut menu in Solution Explorer for your SSIS component
project. This is shown in Figure 19-14.
590 Part III Microsoft SQL Server 2008 Integration Services for Developers

FIgure 19-14 The Build option compiles your project into an assembly.

After your assembly has been successfully built—that is, it has no design-time errors—you
can deploy the assembly so that it can be used by BIDS. The assembly file has a .dll extension;
you need to register it in the global assembly cache (GAC). There are several methods for
performing a registration.

For this example, you’ll use the gacutil.exe tool, which can be found in the directory where
the .NET Framework Software Development Kit (SDK) or Microsoft Windows SDK has been
installed. On our sample server, which has SQL Server 2008 and Visual Studio 2008 installed,
we find gacutil.exe in the C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\ folder.

To register an assembly in the GAC, you just have to execute the following line from the
Visual Studio command prompt (which you can find under Visual Studio Tools on the
Windows Start menu):

gacutil.exe –iF <assembly_file.dll>

After you register the assembly in the GAC, you have to deploy it to a specific folder inside
the SQL Server 2008 installation directory. Here, in the DTS folder (by default, located at
C:\Program Files\Microsoft SQL Server\100\DTS), there are specific directories where—
depending on the type of custom object you’re deploying—you have to copy your assembly.
Table 19-3 shows a list of object types and the corresponding directories to which the assem-
blies should be copied.

TAble 19-3 SSIS Custom Objects and Target Directories


Custom Object Type Target Directory
Task Tasks
Connection manager Connections
Log provider LogProviders
Data flow component PipelineComponents
Foreach enumerator ForEachEnumerators

Using Visual Studio 2008, you can make this process automatic by using the post-build
events. All you have to do is to add the following code:
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 591
C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\gacutil.exe -iF $(TargetPath)
copy $(TargetPath) C:\Program Files\Microsoft SQL Server\100\DTS\<dir>\$(TargetFileName)

In place of <dir>, specify the appropriate directory as indicated in Table 19-3.

After you successfully complete all of the steps just described, your component will be nearly
ready to be used from within the BIDS Toolbox. However, for custom task objects or data
flow components, you have to perform an additional step to be able to use either type. You
will want to make your object available in the appropriate Toolbox window.

To add an object to the Toolbox, right-click on the Toolbox and then click Choose Item. The
Choose Toolbox Items dialog box opens, where you can choose an object to add. Note that
in our example, shown in Figure 19-15, SSIS data flow items and SSIS control flow items are
shown on two separate tabs.

FIgure 19-15 The Choose Toolbox Items dialog box includes tabs for data flow components and control flow
components.

After selecting your component, you can finally start to use it in your SSIS packages.

We’ll look next at an example of a business problem that relates to BI and how you might
solve it by writing a custom-coded task. The business problem is the need to compress data.
Because of the massive volumes of data that clients often need to work with during their BI
projects, we’ve found data compression via SSIS customization to be quite useful.

Control Flow Tasks


To build a custom task that can be used in the control flow, you use the base class Task.
This class can be found in the namespace Microsoft.SqlServer.Dts.Runtime in the assembly
592 Part III Microsoft SQL Server 2008 Integration Services for Developers

Microsoft.SqlServer.ManagedDTS. This is shown in the Add Reference dialog box in


Figure 19-16.

FIgure 19-16 Add Reference dialog box

If you want to supply a user interface for your task (and you normally will), you also have
to add a reference to the assembly Microsoft.SqlServer.Dts.Design. Before starting to write
the code of your task, you have to be sure that it can be correctly integrated with the SSIS
environment. For that reason, you have to decorate the class that derives from Task with
DtsTaskAttribute, which specifies design-time information such as task name, task user inter-
face, and so on:

{DtsTask(
DisplayName="CompressFile",
IconResource="DM.SSIS.ControlFlow.Tasks.Compress.ico",
UITypeName="DM.SSIS.ControlFlow.Tasks.CompressFileUI," +
"DM.SSIS.ControlFlow.Tasks.CompressFile," +
"Version=1.0.0.0," +
"Culture=Neutral," +
"PublicKeyToken=c0d3c622a17dee92"
)]
public class CompressFileTask : Task

Next you implement the methods Validate and Execute. These two methods will be called by
the SSIS engine. The Validate method is called when the engine starts the validation phase.
Here, you can check whether the objects on which your task will work are ready to be used.
The Execute method is called in the execution phase, as the name suggests. In this phase, you
should write the code that performs the actions you desire. For this example, because you’re
developing a task that compresses files, the Validate method will check that a target file has
been specified while the Execute method will do the compression.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 593

In our example, we’ve chosen also to implement a user interface to make the task more user
friendly. For a custom object, you must implement the user interface through the use of a
standard Windows Form. This user interface will be shown to whoever uses the custom task
in their own packages. During package development, the interface needs to be created in
such a way that BIDS can open it internally and make it communicate with the SSIS develop-
ment infrastructure.

To do this, create a separate class that implements the interface IDtsTaskUI. The interface has
four methods, and the one you use to create the user interface is called GetView. Create the
Windows Form that you defined as your user interface, and return it to the caller (which is
BIDS):

public System.Windows.Forms.ContainerControl GetView()


{
return new CompressFileForm(_taskHost);
}

The _taskHost variable holds a reference to the task control that will use this GUI—in this
case, the CompressFileTask created earlier. It allows the GUI to access and manipulate the
properties made available by the task.

You might be wondering at this point how you bind the interface you’ve developed so far
to the class that has been derived from Task and contains the run-time core of your custom
component. This is actually not difficult; just specify the fully qualified name of the user-
interface class. This name is a comma-separated string of the following values: type name,
assembly name, file version, culture, and public key token. Then place this information into
the DTSTask attribute’s UITypeName property:

UITypeName = "DM.SSIS.ControlFlow.Tasks.CompressFileUI," +
"DM.SSIS.ControlFlow.Tasks.CompressFile," +
"Version=1.0.0.0," +
"Culture=Neutral," +
"PublicKeyToken=c0d3c622a17dee92"

You can obtain the PublicKeyToken through the GAC or by using the sn.exe tool distributed
with the .NET SDK. Using the –T parameter (which displays the public key token for an assem-
bly) and specifying the signed assembly from which you want to read the public key token
accomplishes this task:

sn.exe –T <assembly_file>

Data Flow Components


Developing custom data flow components is by far the most complex thing you can do while
developing any type of SSIS custom objects. We prefer to buy rather than build, particularly
in this area, because of the complexity of custom development. Other than for commercial
594 Part III Microsoft SQL Server 2008 Integration Services for Developers

use, we haven’t deployed any custom data flow components into any BI solution that we’ve
implemented to date.

The complexity of development occurs in part because SSIS uses a large amount of metadata
information to design and run data flows. Thus, your component also needs to deal with all
that metadata, providing and consuming it.

For this reason, development of a custom data flow component is not described here. The
topic would require a chapter of its own and is beyond the scope of this book. SQL Server
Books Online includes a tutorial on this topic and a sample. So, if custom development of
data flow components is part of your project, we recommend that you start with SQL Server
Books Online.

Other Components
In addition to creating custom control flow tasks and data flow components, you can also
create custom connection managers, custom log providers, and custom foreach enumerators.
This last option is one of the most interesting because it allows you to extend the supported
sets over which the Foreach Loop container can iterate.

To create a custom enumerator, you just create a class that derives from the ForEachEnum­
erator base class. As with other types of custom object development, if you want to supply
a user interface you have to decorate that class with the DtsForEachEnumerator attribute,
specifying at least the DisplayName and UITypeName properties.

When creating a foreach enumerator, the most important methods to implement are Validate
and GetEnumerator. Validate is the usual method for checking that all the objects on which
your custom object will work are defined correctly and ready to be used. GetEnumerator is
the method that provides the set of values over which the Foreach Loop will iterate. The out-
put of that method must be an object that supports enumeration, such as List or ArrayList.

The user interface in this case is not created using a Windows Form. You have to inherit from
ForEachEnumeratorUI, which in turn derives from System.Windows.Forms.UserControl. This is
needed because the control that you create is displayed in the Enumerator configuration area
of the Collection page of the Foreach Loop Editor dialog box.

For example, suppose that you’re developing a foreach enumerator that allows package
developers to iterate over two dates. In Visual Studio, your user interface will look like the
one shown in Figure 19-17.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 595

FIgure 19-17 Sample custom enumerator interface

To use your new object, you first have to deploy the project in the usual way, with the only
difference being that now you don’t have to add anything to the Toolbox. Rather, you just
drop a ForEach Loop on your package and edit its properties. You’ll find your newly created
extension in the Foreach Loop Editor dialog box, as shown in Figure 19-18.

FIgure 19-18 Custom enumerator in BIDS

You can now start to use it in the same way as you use other built-in enumerators.

Creating custom SSIS objects is not an extremely complex task, yet it is not trivial either. If
you want to learn more about custom component development, a good place to start is SQL
Server Books Online. Under the Integration Services Development topic, there are a lot of
samples, a tutorial, and of course, a good explanation of everything you need to begin your
development process.
596 Part III Microsoft SQL Server 2008 Integration Services for Developers

Overview of SSIS Integration in Custom Applications


So far, you’ve learned how you can extend SSIS through custom object development. But
sometimes you might want to extend your custom application by integrating SSIS functional-
ity directly into it, rather than extending SSIS itself.

In this section, we’ll answer the following questions:

■■ How can you integrate SSIS package functionality into custom applications so that you
can execute packages from inside of those applications?
■■ How do you enable your application to consume the data that SSIS produces as a result
of a data flow task?

Before we explain how you can do these powerful things, we should emphasize two vital
concepts:

■■ A package runs on the same computer as the program that launches it. Even when a
program loads a package that is stored remotely on another server, the package runs
on he local computer.
■■ You can run an SSIS package only on a computer that has SSIS installed on it. Also, be
aware that SQL Server Integration Services is now a server component and is not redis-
tributable to client computers in the same manner that the components required for
SQL Server 2000 Data Transformation Services (DTS) were redistributable. So, even if
you want to install only Integration Services on a client computer, you need a full server
license!

To start to load and execute a package in your application, add a reference to the
Microsoft.SqlServer.ManagedDTS assembly. In this assembly, you’ll find the namespace
Microsoft.SqlServer.Dts.Runtime, which contains all the classes you need.

The first class you have to use is the Application class. All you have to do is load a package
with the appropriate Load method, depending on where the package is stored, and then you
can run it by using the Execute method:

Microsoft.SqlServer.Dts.Runtime.Application app =
new Microsoft.SqlServer.Dts.Runtime.Application();

Package pkg = app.LoadPackage(@"Sample Package 10 - Custom Enum.dtsx", null);


pkg.Execute();

If you also need to intercept and consume the events fired from packages, you have to cre-
ate a custom class that handles the events that derive from the base class DefaultEvents.
Then you can override the event handler that you’re interested in. If you need to intercept
Information and Error events, your code will look like the following sample.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 597
public class CustomEvents : DefaultEvents
{
public override bool OnError(DtsObject source, int errorCode,
string subComponent, string description, string helpFile,
int helpContext, string idofInterfaceWithError )
{
Console.WriteLine(description);
return false;
}

public override void OnInformation(DtsObject source,


int informationCode, string subComponent, string description,
string helpFile, int helpContext, string idofInterfaceWithError,
ref bool fireAgain)
{
Console.WriteLine(description);
}
}

Next, the Execute method should be changed to pass in a reference to the custom event lis-
tener so that it can route fired events to your handlers:

CustomEvents ce = new CustomEvents();


Package pkg = app.LoadPackage(@"Sample Package 10 - Custom Enum.dtsx", ce);
pkg.Execute(null, null, ce, null, null);

It’s really quite simple to run a package from a custom application. We expect that you’ll find
creative ways to use this powerful and flexible capability of SSIS.

Next we’ll look at how to execute a package that is considered to be a data source so that
you can consume its results with a DataSet or DataReader class. This gives you the ability to
create applications that can display the result of a data flow elaboration on screen or to visu-
alize on a grid all the rows processed with errors so that end users can immediately correct
them.

To do either of these things, you have to reference the assembly Microsoft.SqlServer.Dts.


DtsClient.dll. It can be found in %ProgramFiles%\Microsoft SQL Server\100\DTS\Binn. This
assembly contains specific implementations of the IDbConnection, IDbCommand, and
IDbDataParameter interfaces for SSIS, which allows you to interact with SSIS as a standard
ADO.NET data source. You can use the standard .NET classes and methods to access data
from the package. Thanks to these classes, you can invoke a package execution using the
DtsConnection class. This class can be used to execute a DtsCommand that provides an
ExecuteReader method so that you can use it to have a .NET DataReader that can be popu-
lated to a grid, for example.

In the next example, you start by creating and initializing a DtsConnection class. Just like any
other connection classes, your custom connection needs a connection string. In this case, the
connection string is the same string you’ll use as an argument for the dtexec.exe tool to run
the desired package.
598 Part III Microsoft SQL Server 2008 Integration Services for Developers

DtsConnection conn = new DtsConnection();


conn.ConnectionString =
@"-f C:\Work\SQL2008BI\Chapter15\Packages\SQL2008BI\Sample Package 11—DataReaderDest.dtsx";
conn.Open();

Next, create and initialize the command you want to execute to get the data:

DtsCommand cmd = new DtsCommand(conn);


cmd.CommandText = "DataReaderDest";

Here the word command is used somewhat inappropriately because there are no commands
to run. The command really points to a special destination that a package executed in that
way needs to use. This special destination is the DataReader destination. The CommandText
property needs to point to the name of the DataReader destination in the package that you
are running.

In this example, the data flow in the package looks like the sample shown in Figure 19-19.

FIgure 19-19 Data flow for the current example

Because DtsCommand implements IDbCommand, you can use the usual ExecuteReader to
get a DataReader object that can be used to populate a DataGrid. The result is a simple
application that shows on a grid the rows processed by the package that has been sent to
the DataReaderDest destination. An example using a Windows Forms interface is shown in
Figure 19-20.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services 599

We are excited by possibilities of integration into many different types of UI environments,


and we see this as a big growth area for SSIS developers. We are not done with this story
quite yet, however.

FIgure 19-20 Data flow output application in a Windows Forms interface

By using the same Microsoft.SqlServer.ManagedDTS assembly and the Microsoft.SqlSer ver.Dts.


Runtime namespace, you can manage SSIS packages in a fashion that is similar to the func-
tionality available in SQL Server Management Studio (SSMS) or by using the dtutil.exe tool.
For example, you can enumerate all the available packages in a specific storage location (SQL
Server or SSIS Package Store), or you can manage storage locations by creating, removing, or
renaming folders inside them. Importing and exporting packages into a storage location is
also possible.

Any of these actions can be accomplished by using the Application class that exposes explicit
methods for them:

■■ LoadFromSqlServer
■■ RemoveFromSqlServer
■■ CreateFolderOnSqlServer
■■ RemoveFolderOnSqlServer

Of course, the list is much longer. This is just a sample to show you the methods you can use
to load, create, or remove packages and folders in a SQL Server storage location.

Because of the completely exposed object model, you can create any kind of application to
manage and run SSIS packages on your own, which enables you to meet particular business
requirements when creating a custom administrative environment. Two good samples of that
600 Part III Microsoft SQL Server 2008 Integration Services for Developers

kind of application are available on CodePlex, where they are freely downloadable along with
source code:

■■ DTLoggedExec (http://www.codeplex.com/DTLoggedExec) This is a tool that allows


you to run an SSIS package and produces full and detailed logging information of exe-
cution status and package runtime data.
■■ SSIS Package Manager—PacMan (http://www.codeplex.com/pacman) This is a util-
ity designed to permit batch operations on arbitrary sets of SSIS packages. Users can
select a single package, a Visual Studio project, or solution or file system folder tree and
then validate or update all selected packages in one operation.

Summary
In this chapter, we showed you how to extend SSIS via native and custom script tasks. We also
covered the mechanics of creating custom objects. Finally, we showed you how to embed
SSIS packages in custom applications.

Because SSIS is a generic extract, transform, and load tool, it has many generic features that
are of great help in a variety of cases. However, for some specific business problems, you
might find that there are no ready-made solutions. Here lies the great power of SSIS. Thanks
to its exceptional extensibility, you can programmatically implement the features you need to
have, either by using scripting or creating reusable objects.
Part IV
Microsoft SQL Server Reporting
Services and Other Client
Interfaces for Business
Intelligence

601
Chapter 20
Creating Reports in SQL Server 2008
Reporting Services
In this chapter, we look at the parts and pieces that make up an installation of Microsoft SQL
Server 2008 Reporting Services (SSRS). We start by reviewing the installation and configura-
tion processes you’ll use to set up Reporting Services. We then explore the report develop-
ment environment in Business Intelligence Development Studio (BIDS). While doing this,
we’ll walk through the steps you’ll use to build, view, and deploy basic reports of various
types. This is the first of three chapters in which we’ll focus on understanding concepts and
implementation details for using SSRS. In the next two chapters, we’ll look more specifically
at using SSRS as a client environment for SQL Server Analysis Services (SSAS) cubes and data
mining models. Then we’ll discuss advanced SSRS concepts, such as custom client creation,
working directly with the SSRS APIs, and more. Also, we’ll tackle integration between SSRS
and Microsoft Office SharePoint Server 2007 in Chapter 25, “SQL Server Business Intelligence
and Microsoft Office SharePoint Server 2007.”

Understanding the Architecture of Reporting Services


SQL Server Reporting Services was introduced in SQL Server 2005 (with compatibility for
SQL Server 2000). In SQL Server 2008, Reporting Services includes significant enhancements
that make it more versatile and much easier for you to create reports using SQL Server 2008
Analysis Services OLAP objects (that is, OLAP cubes or data mining models) as source data.
Reporting Services is designed to be a flexible enterprise-capable reporting solution for all
types of data sources (that is, relational, multidimensional, text, and so on).

The architecture of SSRS is built around the three types of activities that accompany report-
ing. These activity groups are report creation, report hosting, and report viewing. Each of
these activity groups contains one or more components that can be used to support the
activity. Before we drill into the core components in more detail, here’s a list of them and a
description of their primary functions:

■■ A Microsoft Windows service called SQL Server Reporting Services This is the area
where the core report processing is done in SSRS. This component is required. Here the
core input is processed and results are sent to a hosting environment and rendered as
a report in one of the many available output formats (that is, HTML, Excel, CSV, XML,
Image, PDF, Word, or custom).
■■ A Web service called ReportServer This exposes the core functionality of SSRS via
Web service calls. Although it’s not strictly a requirement to use this service, we’ve
603
604 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

found that all of our clients have chosen to use it. A significant improvement in SSRS
2008 is that Internet Information Services (IIS) is no longer required as a host for this
service. It can be hosted using the http.sys listener. We’ll go into more detail on this lat-
ter point later in this chapter.
■■ A Web site called Reports This is the default end-user interface for viewing the
reports that SSRS produces. End users access this ASP.NET 3.5 Web application, which
is also called Report Manager, by navigating to the (default) URL: http://localhost/
reports. This Web site also contains an administrative interface called Report Manager.
Authorized administrators can perform configuration tasks via this interface. This com-
ponent is optional. Some of our clients have chosen to use it, while others have pre-
ferred to use alternate client hosting environments, such as Office SharePoint Server
2007, custom Web sites, and so on.
■■ Command-line utilities, such as rsconfig.exe and others SSRS ships with several
command-line utilities that facilitate scripting of common administrative tasks. Also,
some administrative tasks can be completed by connecting to an SSRS instance using
SSMS.
■■ Report development environments SSRS includes a couple of templates in BIDS that
enable quick report development. Microsoft also plans to release an upgraded version
of the stand-alone visual report creation tool named Report Builder. As of this writing,
the announced plan is to release the upgraded version of Report Builder in late 2008
after RTM. Also, if you’re using Visual Studio 2008, SSRS adds an embeddable com-
ponent, called the Report Viewer, to the Toolbox for Windows Forms and Web Forms
development projects.
■■ Metadata repository SSRS uses 31 SQL Server tables to store metadata for SSRS itself
and for its configured reports. These tables can be either stored in a dedicated SQL
Server database or integrated with Office SharePoint Server metadata (which is also
stored in a SQL Server database).
■■ Integrating hosting and viewing (optional) Depending on what other Microsoft
products are installed, such as Office SharePoint Server 2007 or other products, you
might have access to prebuilt and configurable SSRS hosting applications. These are
usually Web sites. In the case of Office SharePoint Server 2007, there is a set of tem-
plates, called Report Center, that ships as part of Office SharePoint Server 2007. We’ll
take a closer look at this and at the integration of SSRS and Office SharePoint Server
2007 metadata in Chapter 25. Also new to SQL Server 2008 SSRS is the ability to render
SSRS in Microsoft Office Word (new) or in Microsoft Office Excel (same as the previous
version for workbooks and enhanced for Excel chart rendering).

Figure 20-1 (from SQL Server Books Online) shows the core components of Reporting
Services as well as a reference to third-party tools. One particularly compelling aspect
of SSRS in general is that Microsoft has exposed a great deal of its functionality via Web
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 605

services, making SSRS quite extensible for all three phases of report work—that is, design,
administration, and hosting.

Reporting
Web Report Report Model Services Third-Party
Browser Builder Designer Designer Configuration Tools

Report
Manager

Report Server Components


Authentication
Programmatic Extensions
Interfaces
Scheduling and
Delivery Processor Report Processing
Extensions

Report
Processor
Delivery Rendering
Extensions Extensions

Data Processing
Extensions

Report Server Database Data Source

Key:
Windows service components
Web service components
Components common to Windows service and Web service
Report Manager components
FigURe 20-1 SSRS architecture (from SQL Server Books Online)

In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we introduced SSRS


installation considerations. There we examined the security context for the SSRS service itself.
We also examined backup and restore strategies. In the next section, we’ll expand on our ini-
tial discussion about SSRS installation and setup.
606 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

installing and Configuring Reporting Services


As mentioned previously in this chapter, a major change to SSRS in SQL Server 2008 is the
removal of the dependency on IIS. This change was driven by customer demand, and it
makes SSRS more attractive in environments where the installation of IIS would have posed
an unacceptable security risk (or administrative burden). In fact, there is a SQL Server Books
Online entry describing how you need to configure IIS to prevent conflicts if you choose to
install both SSRS and IIS side by side on the same server: “Deploying Reporting Services and
Internet Information Services Side-by-Side.”

The first consideration when installing SSRS is, of course, on which physical server (or serv-
ers) to install its components. Although it’s physically possible to install SSRS on a physical
machine where either SSAS or SQL Server Integration Services (SSIS) has been installed (or
both have been), we don’t find this to be a common case in production environments. More
commonly, SSRS is installed on at least one dedicated physical machine, sometimes more,
depending on scalability or availability requirements. In Chapter 22, “Advanced SQL Server
2008 Reporting Services,” we’ll revisit multiple-machine installs, but as we get started, we’ll
just consider a single, dedicated machine as an installation source.

Another important consideration when planning your SSRS installation is which edition of
SSRS you should use. There are significant feature differences between the Enterprise and
Standard editions of SSRS—most of which have to do with scalability. We suggest you review
the feature comparison chart and base your edition decision on business requirements—par-
ticularly the number of end users that you expect to access the SSRS instance. For a complete
list of feature differences by edition, go to http://download.microsoft.com/download/2/d/
f/2df66c0c-fff2-4f2e-b739-bf4581cee533/SQLServer%202008CompareEnterpriseStandard.pdf.

After determining what hardware to use for your SSRS installation, your next consideration
is component installation and service account configuration. To install SSRS, you use SQL
Server 2008’s installer, which presents you with a series of dialog boxes where you enter the
configuration information. This information is stored in one of two types of locations—either
in XML configuration files (named RSReportServer.config or ReportServerServices.exe.config)
or in SQL tables. There are two possible locations for these SQL Server metadata tables. They
can either reside in a dedicated metadata database on a selected SQL Server 2008 instance
(called native mode) or be part of a SharePoint metadata SQL Server database (called
SharePoint integrated mode). Native mode is the default installation type. By default, native
mode creates two metadata databases—named ReportServer and ReportServerTempDB—in
the installed SQL Server 2008 instance. You can choose the service account that the SSRS
Windows service runs under during the installation. This account should be selected based
on your project’s particular security requirements.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 607

Tip There is a new, third type of installation called a files-only installation. This type generates
configuration files as a result of the actions you take when using the Setup Wizard. This type of
installation is particularly useful for moving SSRS work from a development environment to a
production environment.

Microsoft provides both command-line tools and a graphical user interface for managing
this metadata after installation. The GUI tool is called the Reporting Services Configuration
Manager and is shown in Figure 20-2. You can see from this tool that you have the following
configuration options: Service Account, Web Service URL, Database, Report Manager URL,
E-Mail Settings, Execution Account, Encryption Keys, and Scale-Out Deployment.

FigURe 20-2 Reporting Services Configuration Manager

You might be wondering exactly how the stripped-down HTTP listener works and provides
the functionality that formerly required IIS. We’ll take a closer look at that next. After that,
we’ll discuss some of the other core components of SSRS in more detail as well.

First we’ll include another component diagram of SSRS from SQL Server Books Online in
Figure 20-3. As we take a closer look at the architecture of SSRS, we’ll drill into the functional-
ity provided by the core components shown in this diagram. Note that the diagram indicates
core and optional (external) components of SSRS. As was the case in SQL Server 2005 SSRS,
in 2008 both the Report Manager and the Web service components are built on the ASP.NET
page framework. This allows experienced .NET developers to easily extend most of the core
functionality programmatically if requirements necessitate such actions.
608 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Service Architecture
HTTP Listener RPC WMI
Reporting Services
Authentication WMI Provider

Report Manager Web Service Background Processing

Front-End Access Report Processing Report Processing


to Report Server
Items and Model Processing Model Processing
Operations
All (Authentication, Data, Rendering,
UI Extensions Data, Rendering, and Report
Report Processing) Processing
Extensions

Scheduling

Subscription and
Delivery

Database
ASP.NET ASP.NET Maintenance

Application Domain Management Memory Management

Service Platform

Key:
External components
Internal components
Feature components
FigURe 20-3 SSRS component architecture (from SQL Server Books Online)

New in SQL Server 2008 is the ability to interact with SSRS using Windows Management
Instrumentation (WMI) or Windows Management Interface queries. This is a welcome addi-
tion that makes administrative control more flexible.

HTTP Listener
New to SSRS 2008 is the use of the HTTP listener (also called by its file name, which is
http.sys). The HTTP listener monitors incoming requests on a specific port on the system
using http.sys. The host name and port are specified on a URL reservation when you config-
ure the server. Depending on the operating system you’re using, the port you specify can be
shared with other applications. As we mentioned, this approach effectively removes the need
for an IIS instance. This is an important improvement to SSRS and one that allows us to select
SSRS as a business intelligence (BI) client for a greater number of customers who had previ-
ously objected to the IIS dependency.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 609

The HTTP listener implements the HTTP 1.1 protocol. It uses the hosting capabilities that are
built into the operating system—for example, http.sys itself. For this reason, SSRS requires
operating systems that include http.sys as an internal component, such as Windows XP
Professional, Windows Vista Business, Windows Server 2003, or Windows Server 2008.

When the HTTP listener processes a request, it forwards it to the authentication layer to
verify the user identity. The Report Server Web service is called after the request is authen-
ticated. The most common way to configure the http.sys interface is by using the Reporting
Services Configuration Manager shown earlier in Figure 20-2.

Report Manager
Report Manager is an administrative client application (Web site) that provides access to the
Report Server Web service via pages in the included reporting Web site. It’s the standard
tool for viewing and managing Report Server content and operations when SSRS is config-
ured in native mode. Reporting Services Configuration Manager can be used either locally
or remotely to manage instances of Reporting Services, and it runs in a browser on the client
computer. Session state is preserved as long as the browser window is open. User-specific
settings are saved to the Report Server database and reused whenever the user connects to
Report Manager.

In addition to using Report Manager, you can also use the command-line tool rs.exe with
scripts to automate administrative processes associated with reporting. Some of these
include scheduled execution of reports, caching options, and more. For more information
about using rs.exe and to see some sample scripts, see the SQL Server Books Online topics
“rs Utility” and “Script Samples (Reporting Services).”

If you configure Report Server to run in SharePoint integrated mode, Report Manager is
turned off and will not function. The functionality normally provided by Report Manager is
included in a SharePoint report library. SSRS 2008 no longer allows you to manage SSRS con-
tent from SSMS, so you must manage it through Report Manager or Office SharePoint Server
2007 if you’re running in SharePoint integrated mode. In addition, new pages have been
added to Report Manager for generating models, setting model item security, and associat-
ing click-through reports to entities in a model.

Report Server Web Service


The Report Server Web service is the core engine for all on-demand report and model
processing requests that are initiated by a user or application in real time, including most
requests that are directed to and from Report Manager. It includes more than 70 public
methods for you to access SSRS functionality programmatically. The Report Manager Web
site accesses these Web services to provide report rendering and other functionality. Also,
610 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

other integrated applications, such as the Report Center in Office SharePoint Server 2007, call
SSRS Web services to serve up deployed reports to authorized end users.

The Report Server Web service performs end-to-end processing for reports that run on
demand. To support interactive processing, the Web service authenticates the user and
checks the authorization rules prior to handing a request. The Web service supports the
default Windows security extension and custom authentication extensions. The Web service
is also the primary programmatic interface for custom applications that integrate with Report
Server, although its use is not required. If you plan to develop a custom interface for your
reports, rather than using the provided Web site or some other integrated application (such
as Office SharePoint Server 2007), you’ll want to explore the SQL Server Books Online topic
“Reporting Services Web Services Class Library.” There you can examine specific Web meth-
ods. In Chapter 22, we’ll provide some examples of working directly with this API. For most
of our BI solutions, we find that our clients prefer custom application development to the
canned Web site included with SSRS.

Authentication
All users or automated processes that request access to Report Server must be authenti-
cated before access is allowed. Reporting Services provides default authentication based on
Windows integrated security and assumes trusted relationships where client and network
resources are in the same domain or a trusted domain. You can change the authentication
settings to narrow the range of accepted requests to specific security packages for Windows
integrated security, use Basic authentication, or use a custom forms-based authentication
extension that you provide.

To change the authentication type to a method other than the default, you must deploy a
custom authentication extension. Previous versions of SSRS relied on IIS to perform all types
of authentication. Because SSRS 2008 no longer depends on IIS, there is a new authentica-
tion subsystem that supports this. The Windows authentication extension supports multiple
authentication types so that you can precisely control which HTTP requests a report server
will accept. If you’re not familiar with the various Windows authentication methods—NTLM,
Kerberos, and so on—see http://www.microsoft.com/windowsserver2003/technologies/
security/kerberos/default.mspx for Kerberos and http://msdn.microsoft.com/en-us/library/
aa378749.aspx for NTLM for more information. Included authentication types include the
following:

■■ RSWindowsNegotiate Directs the report server to handle authentication requests


that specify Negotiate. Negotiate attempts Kerberos authentication first, but it falls
back to NTLM if Active Directory cannot grant a ticket for the client request to the
report server. Negotiate falls back to NTLM only if the ticket is not available. If the first
attempt results in an error rather than a missing ticket, the report server does not make
a second attempt.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 611
■■ RSWindowsKerberos Reads permissions on the security token of the user who
issued the request. If delegation is enabled in the domain, the token of the user who is
requesting a report can also be used on an additional connection to the external data
sources that provide data to reports.
■■ RSWindowsNTLM Authenticates a user through an exchange of private data
described as challenge-response. If the authentication succeeds, all requests that require
authentication will be allowed for the duration of the connection. NTLM is used instead
of Kerberos under the following conditions:
❏■ The request is sent to a local report server.
❏■ The request is sent to an IP address of the report server computer rather than a
host header or server name.
❏■ Firewall software blocks ports used for Kerberos authentication.
❏■ The operating system of a particular server does not have Kerberos enabled.
❏■ The domain includes older versions of Windows client and server operating sys-
tems that do not support the Kerberos authentication feature built into newer
versions of the operating system.
■■ RSWindowsBasic Passes credentials in the HTTP request in clear text. If you use
Basic authentication, use Secure Sockets Layer (SSL) to encrypt user account informa-
tion before it’s sent across the network. SSL provides an encrypted channel for send-
ing a connection request from the client to the report server over an HTTP TCP/IP
connection.

By default, only WindowsNegotiate and WindowsNTLM are enabled. Each of these authen-
tication types can be turned on or off as necessary. You make changes to the default
configuration to enable other types of Windows authentication by making changes to
the RSReportServer.config files. For specifics, see the SQL Server Books Online topic “How
to: Configure Windows Authentication in Reporting Services.” As mentioned, to use non-
Windows authentication, custom authentication providers must also be used. You can enable
more than one type of authentication if you want the report server to accept multiple
requests for authentication.

We’re pleased to see the improved flexibility in configuring authentication mechanisms for
SQL Server 2008 SSRS. It has often been a business requirement to implement some type of
custom authentication in our production BI projects.

Note In SQL Server 2008, Reporting Services does not support anonymous or single sign-on
authentication unless you write and deploy a custom authentication provider.
612 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Background Processing (Job Manager)


Report Server also contains a job manager that enables background processing. Background
processing refers to operations that run in the background and are initiated by Report Server.
Most background processing consists of scheduled report processing and subscription deliv-
ery, but it also includes Report Server database maintenance tasks.

Background processing for scheduling, subscription, and delivery is configurable and can be
turned off through the ScheduleEventsAndReportDeliveryEnabled property of the Surface
Area Configuration for the Reporting Services facet in Policy-Based Management. For more
information on doing this, see the topic “How to: Turn Reporting Services Features On or
Off” in SQL Server Books Online. If you turn those operations off, scheduled report or model
processing will not be available until they’re re-enabled. The Database Maintenance task
is the only task that cannot be turned off because it provides core database maintenance
functionality.

Background processing operations depend on a front-end application or the Web service for
definition. Specifically, schedules and subscriptions are created in the application pages of
Report Manager or on a SharePoint site if the report server is configured for SharePoint inte-
gration, and then they’re forwarded to the Web service, which creates and stores the defini-
tions in the report server database.

All of the aforementioned components work together and provide reporting functionality
to administrators, developers, and end users. The sum total makes SSRS a very viable enter-
prise-capable reporting platform. After you’ve installed, configured, and verified your SSRS
instance, you’ll want to move on to the work of developing reports. In the next section, we’ll
take a look at using BIDS (which is just one of several possible ways you can use to author
reports for SSRS) to develop, preview, and deploy reports to SSRS Report Server.

Creating Reports with BiDS


To get started developing our first report, we’ll use BIDS. As mentioned previously, we’ll start
by building reports from OLTP data sources so that we can first focus on learning how to
use the report designer in BIDS. In the next chapter, we’ll look at how BIDS is used to design
reports for cubes and mining models.

To get started, in the New Project dialog box, select the Business Intelligence Projects project
type and then choose Report Server Project in the Templates area, as shown in Figure 20-4.
This gives you a blank structure of two folders in Object Explorer: one folder for shared data
sources, and one folder for reports.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 613

FigURe 20-4 BIDS contains two SSRS development templates.

Developing a report for SSRS consists of two basic steps: defining the data source, and defin-
ing the report layout. Data sources can be either shared by all reports in a project or local to
a specific report. Shared data sources are preferred to private (or report-specific) data sources
for several reasons. The first reason is that it’s most typical to be developing reports on one
set of servers with the expectation of deploying those reports to a different set of production
servers. Rather than having to change connection string information in each report in a proj-
ect, using shared data sources allows developers (or administrators) to update configuration
information once for each group of reports in that project when those reports are deployed
to production servers.

To define a new shared data source in your project, simply right-click on the Shared Data
Sources folder in Solution Explorer, and then click Add New Data Source on the shortcut
menu. Clicking Add New Data Source opens the Shared Data Source Properties dialog box
shown in Figure 20-5.

FigURe 20-5 The Shared Data Source Properties dialog box

Note that data sources can be of any type that is supported by Reporting Services. By
default, the available data source types are Microsoft SQL Server, OLE DB, Microsoft SQL
Server Analysis Services, Oracle, ODBC, XML, Report Server Model, SAP NetWeaver BI,
Hyperion Essbase, and TERADATA. For our example data source, click Microsoft SQL Server
and then click the Edit button to bring up the standard connection dialog box. There you
614 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

enter the server name, database name, and connection credentials. We’ll use the sample rela-
tional database AdventureWorksDW2008 for this first sample report.

After you complete the configuration, a new shared data source is added to the Report
Server project. To add a new report, right-click the Reports folder and select Add, New Item,
and then Report from the shortcut menu. This opens the report designer interface shown in
Figure 20-6. Note that the report designer surface contains two tabs—Design and Preview.
Also, near the bottom of the designer surface, there are two sections named Row Groups
and Column Groups. These have been added to the SSRS 2008 BIDS report designer to make
report design more intuitive.

FigURe 20-6 The SSRS report designer

To build a report, you need data to display. To obtain the data, you need to provide a query
in a language that is understood by the data source—that is, Transact-SQL for OLTP, MDX for
OLAP, and so on. Query results are called datasets in SSRS.

For the next step, open the Report Data window, which should be on the left side of the BIDS
window. Select New and then Data Source from the toolbar at the top of the Report Data
window. This opens the Data Source Properties dialog box shown in Figure 20-7.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 615

FigURe 20-7 The Data Source Properties dialog box

In this dialog box, you first reference a connection string or data source. You can either cre-
ate a new (private) connection string using this dialog box or reference any shared data
source that you’ve defined for this project. In this case, you should select the shared data
source created earlier. Note also that you can optionally select the Use Single Transaction
When Processing The Queries check box. Selecting this option causes all queries associated
with the dataset to execute as a single transaction.

After you’ve configured your data source, you need to configure the login credentials for this
particular data source. Click the Credentials link in the left pane to view the properties for the
data source credentials as shown in Figure 20-8. When using a shared data source, the con-
trols here will be disabled, as security information is defined in the shared data source.

It’s important to understand how these credentials will be used. The default is to use the
credentials of the user requesting the report via Windows Integrated Authentication. Other
choices are to specify a user name and password to be used every time the report is pro-
cessed, prompt the user to enter credentials, or use no credentials. When you configure this
dialog box, you’re setting the design-time credentials. Be aware that authorized administra-
tors can make changes to these settings if your business requirements call for such at run
time by using the SSRS administrative Web site to update the values associated with the con-
nection string (data source).
616 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigURe 20-8 The Data Source Properties dialog box for specifying credentials

Click OK to finish the data source configuration, and then right-click on the data source in
the Report Data window and choose New Dataset to create a new dataset. Click the Query
Designer button in the resulting dialog box. SSRS includes multiple types of query designers,
and we’ll detail those shortly.

Because the shared data source you created earlier is based on a relational data source,
in this case the generic query designer (shown in Figure 20-9) opens. The type of query
designer that opens is dependent on the type of data source—that is, SQL opens Transact-
SQL, SSAS opens MDX, and so on.

Reporting Services provides various query design tools that can be used to create queries in
the report designer. The kind of data that you’re working with determines the availability of
a particular query designer. In addition, some query designers provide alternate modes so
that you can choose whether to work in visual mode or directly in the query language. Visual
mode allows you to create queries using drag and drop or guided designers, rather than by
just typing in the query code ad hoc. There are five different types of query designers that
can be used, depending on the type of data that you’re working with:

■■ Generic query designer The default query building tool for most supported rela-
tional data sources.
■■ Graphical query designer Used in several Microsoft products and in other SQL Server
components. It provides a visual design environment for selecting tables and columns.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 617

It builds joins and the (relational) SQL statements for you automatically when you select
which columns to use.
■■ Report model query designer Used to create or modify queries that run against
a report model that has been published to a report server. Reports that run against
models support click-through data exploration by authorized end users. The idea is to
provide end users with a subset of source data against which they can click and drag to
create reports based on further filtered subsets of the original data subset. The query
that the end user creates by clicking and dragging objects (called entities, attributes,
and so on) determines the path of data exploration at run time.
■■ MDX query designer Used to create queries that run against an Analysis Services or
other multidimensional data source. This query designer becomes available when you
create a dataset in the report designer that uses an Analysis Services, SAP NetWeaver
BI, or Hyperion data source.
■■ DMX query designer Used to retrieve data from a data mining model. To use this
query designer, you must have an Analysis Services data source that includes a data
mining model. After you select the model, you can create data mining prediction que-
ries that provide data to a report.

FigURe 20-9 The generic query designer


618 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

An alternative to launching the query designer as just described is to open the dataset
designer from the Report Data window menu. After you do that, you’ll see the Dataset
Properties dialog box shown in Figure 20-10. Here you can define not only the source query,
but you can also configure query parameters, report fields, report options, and report filters.
We’ll look at the last few items in a bit more detail later in this chapter.

FigURe 20-10 The Dataset Properties dialog box

Tip If you’re building reports using OLTP data sources, such as SQL Server, it’s best practice to
first define the source query in the RDBMS as a stored procedure. Your reports will perform bet-
ter if you use stored procedures rather than ad hoc Transact-SQL queries because of the built-in
optimization characteristics associated with stored procedures. Also, limiting RDBMS access to
stored procedures, rather than open SELECT statements, provides much tighter security and is
usually preferred to granting full SELECT permission on RDBMS objects (tables, views, and so on).

To continue the example, you next enter a simple query, such as SELECT * FROM
DimCustomer. Click OK to leave the query designer, and click OK again to close the Dataset
Properties dialog box. After the query is built and executed, you need to place items from
the query onto the report. To do this, you need to view the Report Data window, which can
be accessed either by selecting Report Data from the View menu or pressing Ctrl+Alt+D.

You first need to select a type of report data container on the report designer surface. There
are a couple of different types of containers into which you can place your data. The default
is a table. In SSRS 2008, there a couple of new container types—notably the Tablix and Gauge
containers. We’ll take a closer look at both of those later in this chapter and in the next one.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 619

First we’ll build a basic tabular report. To create a simple tabular report from the dataset cre-
ated earlier, right-click on the designer surface, select Insert, and then select Table. Doing this
opens a blank tabular report layout. Using our dataset defined in the previous paragraphs,
you can drag the Customer Key, Birth Date, Marital Status, and Gender fields from the Report
Data window to the table layout area.

Next you can apply formatting to the header, detail, and table name sections using the for-
matting toolbar or by selecting the cells of interest, right-clicking them, and then clicking
Format. An example of a simple table is shown in Figure 20-11.

FigURe 20-11 The tabular report designer

Also, you’ll see the results of your query displayed in the Report Data window on the left side
of BIDS. It contains an entry for each dataset defined for this report. Each dataset is shown in
a hierarchal tree view, with each displayable column name listed in the tree.

From this designer, you can drag columns from the report dataset and place them in the
desired location on the designer surface. You can also add more columns to create the basic
report layout as shown in Figure 20-11.

Each item on the report surface—that is, row, column, and individual text box—can be con-
figured in a number of ways. The first way is by dragging and dropping one or more dataset
values directly onto the designer surface. SSRS is a smart designer in that it automatically
creates the type of display based on the location you drag the item to on the report surface.
For example, on a tabular report, if you drop a dataset field onto a column in the table, it will
show individual values in the detail rows and add the column name to the header area. This
smart design also applies to automatic creation of totals and subtotals.
620 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

You can also manipulate items by configuring properties in the Properties window by default
on the bottom right in BIDS. In addition, you can use the toolbars, which include a Word-
like formatting toolbar. And finally, you can also right-click on the designer surface to open
a shortcut menu, which also includes an option to configure properties. The report designer
surface is straightforward to use and is flexible in terms of formatting options. New to SSRS
2008 are sections at the bottom of the report designer surface named Row Groups and
Column Groups. You use these areas to quickly see (and change if desired) the grouping lev-
els defined in your report.

After the report is created to your satisfaction, you can click the Preview tab on the report
designer surface to see a rendered version of the report, as shown in Figure 20-12. Of course,
because we’ve applied very little formatting to this report, its appearance is rather plain.
You’ll certainly want to use the built-in capabilities to decorate your production reports with
appropriate fonts, formatting, text annotation, and such to make them more appealing to
their end-user audiences.

FigURe 20-12 The report in preview mode

After you click the Preview tab, you might notice that the Output window opens up in BIDS
as well as the rendered report. If there were no design errors in your report, it will render and
be displayed on the Preview tab. If there were errors, they’ll be listed in the Errors window in
BIDS. And, as with other types of development, if you click on any particular error in the error
list, BIDS takes you to the error location so that you can make whatever correction is needed.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 621

To fix errors, you need to understand what exactly is happening when a report is built
in BIDS. Unlike traditional development, building a report does not compile the Report
Definition Language (RDL). Rather, the RDL is validated against the internal XML schema. If
there are no invalid entries, the report is built successfully, as seen in Figure 20-12. As with
traditional coding, fatal errors are shown using red squiggly lines and warning errors are
shown using blue squiggly lines.

If you’ve made some type of error when you created your report, you see a brief error
description on the Preview tab (rather than the rendered report) and you can open the Errors
window in BIDS (from the View menu) to see more detail about the particular error or errors.

The Errors window lists all errors, ranked first by type—for example, fatal (or red)—and then
includes a description of each error, the file where the error is located, and sometimes the
line and column. Although you could open the RDL associated with the report by right-
clicking the file name in Object Explorer and then clicking View Code, you’ll more commonly
read the error description and then navigate to the GUI location in BIDS where you can fix
the error. After you resolve any error or errors, you simply save your changes and then click
on the Preview tab again and the report will render.

Note When using BIDS, in addition to being able to create new reports you can also import
report definition files that you’ve created using other tools. To import existing files, you right-
click on the Reports folder in Solution Explorer, click Add, and then click Existing Files. You can
import files of the *.rdl or *.rdlc format. To import Microsoft Office Access reports, you must
have Access 2002 or later installed. Access reports can be imported by selecting Project, Import
Reports, and then Microsoft Access. It’s possible that some aspects of Access reports will not
import correctly and might require manual correction.

Other Types of Reports


As mentioned, you’re not constrained to using only a tabular format for reports. The Toolbox
contains several types of containers and is shown in Figure 20-13. Note that you can select
from Table, Matrix, List, Subreport, Chart, or Gauge.

FigURe 20-13 The Toolbox in BIDS


622 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

The Gauge type is new to SQL Server 2008 SSRS. We most often use Table, Matrix, or Chart.
List is used when you want to apply some type of custom formatting, such as a multicolumn
display. List is a blank container, and lets you apply your own layout. It’s very flexible and a
good choice if your business requirements call for some type of custom formatted output.
You should also note that you can nest other container types—such as tables and charts—
inside of a list.

You might also have heard something about a new type of container, called a Tablix. We’ll
cover the Tablix container in more detail in Chapter 21, “Building Reports for SQL Server 2008
Reporting Services.” At this point, we’ll mention that when you select a table, matrix, or list,
you’re actually getting a Tablix data container.

For our next example, we’ll build a report using the new Gauge control. It’s interesting to see
more visually rich controls, such as the gauge, being added to the SSRS development envi-
ronment. Given the challenge of building reports that appropriately present BI data to vari-
ous end-user communities, we think this a positive direction for SSRS in general.

As with creating any report, to use the Gauge control, you must, of course, first define a data
source and a dataset. Do this following the steps described earlier in this chapter. We’ll use
a simple relational dataset from the sample database, AdventureWorksDW for our sample.
We’ve just used the query SELECT * FROM DimCustomer to get a dataset for display.

In the next step, we dragged the customer last name field onto the gauge display surface.
SSRS automatically applies the COUNT aggregate to this field. Figure 20-14 shows the output.
In addition to displaying an aggregate value on the gauge itself, we’ve also chosen to show
an SSRS system variable and the date and time of report execution, and we chose to include
a report parameter, Co Name, in our particular report.

In addition to the standard output, you can configure the many properties of this rich control
by simply clicking on the item you want to configure—for example, pointer, scale, and so
on—and then accessing its properties in the designer by using the Properties windows or by
using the shortcut menu to open a control-specific properties dialog box.

Sample Reports
You might want to take a look at some sample reports available on CodePlex (see
http://www.codeplex.com/MSFTRSProdSamples/Release/ProjectReleases.aspx?ReleaseId=16045)
so that you can gain a better understanding of the various design possibilities. We think it
will be valuable for you to see the various formatting and layout options as you consider to
what extent you want to use SSRS as a client in your BI project. As with the other types of
samples mentioned in this book—such as SSAS, SSIS, and so on—to work with the sample
SSRS reports, you must download the samples and then install them according to the instruc-
tions on the CodePlex site.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 623

FigURe 20-14 The Gauge control rendered

Deploying Reports
After the report is designed to your satisfaction, the report must be deployed to an SSRS
server instance to be made available to applications and users. As we learned in the error-
handling section earlier, a report consists of a single RDL file. If you choose to use shared data
sources, report deployment also includes those RDS files. Deployment in SSRS simply means
copying the RDL and RDS files from BIDS to your defined deployment location.

To deploy a report project, you must first configure the development environment. To con-
figure the SSRS development environment for report deployment, right-click on the report
server project and then click Properties. This opens the Report Project Property Pages dialog
box as shown in Figure 20-15.

FigURe 20-15 Report deployment properties


624 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

This dialog box is where you specify the location on the SSRS server for any reports that are
contained within the project. There are four properties:

■■ Overwrite Data Sources Allows you to choose to always overwrite all existing data
sources on the server or ignore them if they already exist. This is an important option
to configure according to your requirements, especially if your data sources change
during report development.
■■ TargetDataSourceFolder Allows you to choose the destination deployment folder on
the SSRS server for data connection files.
■■ TargetReportFolder Allows you to choose the destination deployment folder on the
SSRS server for report files.
■■ TargetServerURL Allows you to choose the destination deployment server.

Note If you’re upgrading from SQL Server 2005 SSRS, you should be aware that Microsoft has
made substantial changes to the RDL that SSRS uses. To that end, reports created using BIDS or
Report Builder from SQL Server 2005 must be upgraded to be able to run in SQL Server 2008
SSRS. If these reports were in an upgraded Report Server database or are uploaded to an SSRS
2008 instance, they will be automatically upgraded the first time they’re processed.
Reports that can’t be converted automatically are processed using backward-compatible com-
ponents. You can also convert older reports using BIDS 2008. If you choose not to convert older
reports, you can run those reports by installing SQL Server 2005 SSRS side by side with SQL
Server 2008.

After you have configured this information, in the BIDS Object Explorer window, right-click
on the project name and then click Deploy. As deployment proceeds, BIDS first attempts to
build (that is, validate) all RDL and RDS files. After successful validation, BIDS copies those files
to the configured deployment location.

If you’re using the default SSRS Web site to host your reports, you can then click on the URL
for that Web site to see the reports in a browser. The default URL is http://<%servername%>/
Reports/Pages/Folder.aspx. Of course, you might be using other hosting environments, such
as Office SharePoint Server 2007, a custom Windows Forms application, and so on. We’ll take
a closer look at such alternatives in Chapters 22 and 23.
Chapter 20 Creating Reports in SQL Server 2008 Reporting Services 625

Summary
In this chapter, we discussed the architecture of SSRS. To that end, we took a closer look at
the included components that support authoring, hosting, and viewing report functions.
These components include the Windows service; Web service; Web site, configuration, and
development tools; and metadata storage. We discussed best practices for installation and
configuration. Next we walked through the process of developing a couple of reports using
various container types in BIDS. Finally, we configured properties for completed report
deployment.

In the next chapter, we’ll look more specifically at how best to use SSRS as a client for SSAS
objects. There we’ll look at the included visual MDX and DMX query designers. We’ll also take
a look at the use of container objects that lend themselves to SSAS object display—such as
the new Tablix container.
Chapter 21
Building Reports for SQL Server
2008 Reporting Services
In this chapter, we take a look at the mechanics of creating reports based on SQL Server 2008
Analysis Services (SSAS) objects: OLAP cubes and data mining models. To that end, we exam-
ine using the included MDX Query Designer. Then we take a look at parameter configuration.
We conclude this chapter by looking at the redesigned Report Builder report creation tool
that Microsoft released in October 2008.

We start by introducing best practices related to using the MDX Query Designer that is
included with the SQL Server 2008 Reporting Services (SSRS) developer interface in Business
Intelligence Development Studio (BIDS). This is a good starting point because, of course,
implementing an effective query is the key to producing a report that performs well. Later in
this chapter, we also look at the included DMX Query Designer.

Using the Query Designers for Analysis Services


After you open BIDS to create a new project of type Report Server Project and create a data
source, you’ll see the Reporting Services designer and Toolbox shown in Figure 21-1. As
you begin to create your first report, you must decide whether you prefer to configure one
connection specific to each report—that is, an embedded connection—or to define project-
specific connections. We favor the latter approach for simpler management.

To define project-specific connections, you create a shared data source by right-clicking


on the Shared Data Sources folder in Solution Explorer. Then select Microsoft SQL Server
Analysis Services from the drop-down list of data sources and provide a connection string.
For the examples in this chapter, we’ll continue using the Adventure Works DW 2008 OLAP
database. Configure the connection credentials—that is, Windows, custom authentication,
and so on—as you’ve done in previous connection configurations in BIDS.

In this section, we cover two of the query designers available in Analysis Services: the MDX
Query Designer, with its visual and drag-and-drop modes, and the DMX Query Designer,
which you use when you want to base your report on data mining results. Along the way, we
also provide information on how to set parameters in a query designer.

627
628 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 21-1 Reporting Services designer surface in BIDS

MDX Query Designer


When you make (or reference an existing) connection to an SSAS data source by opening a
new report and adding a dataset, the query designer opens by default in visual MDX query
generation mode. You can switch to manual MDX or to DMX mode by clicking buttons on
the query toolbar. We’ll cover those scenarios subsequently in this section.

After you’ve successfully created a connection, you define a query for the report that you’re
working on by creating a new report, and opening the Report Data window. From there,
choose New, Data Source and select a shared data source based on SSAS. Finally, choose
New, Dataset, select the data source created in the previous step, and click the Query
Designer button. This opens the MDX Query Designer.

By default, the first cube in alphabetical order by name appears in the list of cubes in the
upper right corner of the query designer. If you want to change that value, click on the Build
(…) button to the right of the selected cube to open a dialog box that allows you to select any
of the cubes contained in the OLAP database. After you’ve verified that you’re working with
the desired cube (Adventure Works for this example), you need to create the query.

Tip Just as with other built-in metadata browsers, the browser included in the SSRS Query
Designer includes a Measure Group filter. Because we like to design MDX queries using drag and
drop to save time, we also frequently use the Measure Group filter to limit the viewable measures
and dimensions to a subset of cube attributes.
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 629

As mentioned, there are two modes for working with the query designer when query-
ing OLAP cubes. The default is a visual (or drag-and-drop) mode, as shown in Figure 21-2.
SQL Server Books Online calls this design mode. We recommend using this mode because
it greatly reduces the amount of manual query writing you need to do to build a reporting
solution for SSAS and results in the ability to generate reports much more quickly. The other
mode is manual MDX entry, which we take a closer look at later in this section. SQL Server
Books Online calls this query mode. New in SQL Server 2008 is the ability to import an exist-
ing MDX query in the query designer using the Import button on the toolbar.

FigUre 21-2 SSAS MDX Query Designer in BIDS

Figure 21-3 shows a drag-and-drop query. To create this query, you must first filter the meta-
data view to show only measures and dimensions associated with the Internet Sales group.
Then drag the Internet Sales Amount measure to the designer surface, expand the Date
dimension, and drag the Date.Calendar Year level from the Calendar folder onto the designer
surface. Next drag the Sales Reason level from the Sales Reason dimension to the designer
surface. Finally, configure the slicer value (at the top of the designer) to be set to the Product
dimension, Category hierarchy, and set the filter for the values Bikes and Accessories.
630 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 21-3 MDX SSAS Query Designer in BIDS showing query results

If you want to view or edit the native MDX query that you visually created, click the last but-
ton on the right on the toolbar (the Design Mode button). This switches the visual query
designer to a native query designer (called query mode). You can then view the native MDX
code and, optionally, also re-execute the query to view the results. For our example, the MDX
query looks like that in Figure 21-4.

FigUre 21-4 Manual MDX SSAS query in BIDS

By examining the generated query, you can see that the query designer automatically added
the MDX NON EMPTY keyword to your query. Also, you can see that the dimension levels
were included by name in the query (Sales Reasons), whereas specific dimension mem-
bers were referenced by their key (Product Category). You might recall from Chapter 10,
“Introduction to MDX,” and Chapter 11, “Advanced MDX,” that these naming properties vary
depending on the type of object you’re using in the query.

If you’re thinking, “Wow, that’s a long MDX query statement!” you’re not alone. We’ve said
it before and we’ll say it again here: For improved productivity, make maximum use of the
drag-and-drop MDX query designers in all locations in BIDS—in this case, in the SSRS Query
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 631

Designer. In addition to viewing the MDX query statement in query mode, you can also
edit it.

One word of caution: If you switch from design mode to query mode and then make manual
changes to the query, you won’t be able to switch to design mode without losing any manual
changes that you made while in query mode. Before you switch back, BIDS will generate a
warning dialog box, as shown in Figure 21-5.

FigUre 21-5 Query designer mode switch warning dialog box in BIDS

When you’re working in design mode, you can create local calculated MDX members. To
do this, right-click on the lower left of the designer surface (the Calculated Members area)
and select New Calculated Member from the shortcut menu. You are presented with a dia-
log box where you can write the MDX query for the calculated member. As with other MDX
query writing, we recommend that you drag and drop metadata and MDX functions into
the Expression area of the dialog box rather than manually typing the query. The interface
for creating these calculated members is very similar to the one that you used when creating
global calculated members using BIDS for SSAS.

We’ll elaborate a bit on the concept of local versus global objects. Local means the objects
are visible only to this one particular report. This differs from creating calculated members
as objects for a particular cube using the cube designer Calculations tab in BIDS. Calculated
members created inside the cube can be considered global, rather than calculated members
that you might choose to create using the query designer in BIDS for SSRS. Calculated mem-
bers are local to the specific report where they have been defined. We prefer to create global
calculated members (in the OLAP cube definition that is using BIDS for SSAS) rather than
local (specific to a report) members because the former are more visible, more reusable, and
easier to maintain. The only time we use local (or report-specific) calculated members is when
we have a very specific requirement for a very specific subset of users.

Setting Parameters in Your Query


We briefly interrupt the discussion of query designers to stress the importance of setting the
right parameters in your query. You can enable one or more parameters in your query by
selecting the Parameter option in the filter section at the top right of the query design work
area. These parameters can be presented in the user interface as a blank text box or drop-
down list (showing a list you provide or one that is generated based on another query), and
they can show a default value. You can also allow for the entry or selection of multiple values.
632 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Figure 21-6 shows the same query as was generated in Figure 21-5, but with the Parameter
option selected for the Product filter, now rendered in the query builder so that the MDX
statement is visible (query mode). By examining the generated MDX after you select the
Parameter option in the filter section, you can see that the MDX produced now includes a
parameter value designated by the @ProductCategory value. In addition, an IIf function was
also added to return either the currently selected value or (by default) to set the value to the
currently displayed member value.

FigUre 21-6 Manual MDX SSAS query in BIDS, which includes parameters

After you enable parameters, the query designer adds several important lines of MDX to your
query without you having to write them. Notice that when you view the generated query text
(query mode), you can see two of the other buttons on the toolbar: the Query Parameters
and Prepare Query buttons. These are the fifth and fourth buttons, respectively, from the
right side of the toolbar. They are available only when you’re working in query mode.

When you click the Query Parameters button, a dialog box appears that allows you to visu-
ally configure the query parameters that become available. Here you can specify parameter
names; associate those names to dimensions, attributes, or hierachies (using the Hierarchy
section); allow multiple values; and set a default value. This dialog box is shown in Figure 21-7.

FigUre 21-7 Query Parameters dialog box in the query designer for SSRS
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 633

The Prepare Query button acts much like the blue check button in SQL Server Management
Studio (SSMS) Transact-SQL query mode—that is, when you click it, the query syntax is
checked and any errors are returned to you via a pop-up message. The one item that is lack-
ing in this view is also similar to a missing feature in the Transact-SQL query interface—that
is, IntelliSense. Unfortunately, there isn’t any available. If you need to author most of your
MDX queries manually, we recommend that you obtain a query-writing interface that gives
you more feedback to write your query properly, and then copy and paste the completed
query into the SSRS Query Builder dialog box.

Tip Although MDX IntelliSense is not included with BIDS, you can download tools to help you
with MDX query writing and include MDX IntelliSense from Mosha Pasumansky at this URL:
http://sqlblog.com/blogs/mosha/archive/2008/08/22/intellisense-in-mdx-studio.aspx.

We remind you that although you can use the visual tools as much as possible in the SSRS
MDX Query Designer to improve your query-writing productivity, there is no substitute for
solid cube design and effective MDX query writing. No tool can overcome inefficient data
structure design and inefficient query writing. Also, aggregation and storage design will fac-
tor into your query execution efficiency. Before we go on to look at report layout for OLAP
cubes, we’ll take a quick look at how the query designer works for DMX data models.

DMX Query Designer


As mentioned, when you create a new dataset against an SSAS data source, the default query
designer that opens in BIDS for SSRS is the MDX designer. You can switch this to show the
DMX designer if you want to use data mining models as a basis for your report by clicking
the second toolbar button (the pickaxe icon) from the left. Doing this displays a query inter-
face that is identical to the one we reviewed when querying data mining models in BIDS.
More specifically, this interface looks and functions exactly like the Mining Model Prediction
tab for a particular mining model in BIDS for an SSAS database. Of course, in SSRS, you must
first select the particular mining model you want to use as a basis for your report query.

For our example, select the Targeted Mailing data mining structure from the Adventure
Works DW 2008 sample database. From the structure, select the TM Decision Trees data min-
ing model. Then in the Select Input Table pane in the designer, click the Select Case Table
button and select the vTargetMail view from the list. As in BIDS, the interface in SSRS auto-
matically creates joins between source and destination columns with the same names. Again,
as in BIDS, if you need to modify any of the automatically detected join values, right-click on
the designer surface and then click Modify Mappings to open a dialog box that allows you to
view, verify, and update any connections. Also, identical to BIDS, you can right-click anywhere
on the designer surface and select Singleton Query to change the input interface (which
defaults to a table of values) to a singleton query input.
634 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

After you’ve configured both the source data mining model and the input table or singleton
values, you use the guided query area on the bottom of the designer to complete the query.
In our case, we are showing an example of a singleton query using the PredictProbability
function, taking the [Bike Buyer] source column as an argument to this query function. Figure
21-8 shows this sample in the interface.

FigUre 21-8 DMX Query Designer for SSRS in BIDS in design mode

As with the MDX query designer, if you want to view or edit the generated DMX, you can
do so by clicking the last button on the right side of the toolbar (Design Mode). Clicking this
button renders the query in native DMX. Finally, as with the MDX designer, you can use the
Query Parameters button on the toolbar to quickly and easily add parameters to your DMX
source query. Figure 21-9 shows what the singleton DMX query looks like in query mode in
the query designer. You might get a value of 38 for age, rather than the 37 shown when you
run this query using the sample.

Using the Query Parameters button, we added a parameter to our query. We named it
YearlyIncome and set the default value to 150000. Unlike adding a parameter to an MDX
query, when you add a parameter to a DMX query, the generated DMX does not include that
parameter. The parameter is visible in the Report Data window. If you want to view or update
its properties, you can do so by right-clicking on the parameter in the object tree and then
clicking Properties. This lack of visibility in DMX is because the parameter is a report param-
eter rather than a DMX parameter. We highlight this behavior because it differs from that of
parameters in OLAP cubes (that is, MDX queries) and might be unexpected.
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 635

FigUre 21-9 DMX Query Designer for SSRS in BIDS in query mode

After you’ve written either your MDX or DMX query, you’re then ready to lay out the query
results on a report designer surface. You saw a basic table layout in Chapter 20, “Creating
Reports in SQL Server 2008 Reporting Services.” Now we’ll take a closer look at laying out
reports that use OLAP cubes as a data source, which is the most common approach in busi-
ness intelligence (BI) projects. In the next section, we’ll do just that, using both of the sample
MDX and DMX queries that we just built as a basis for laying out report objects. We’ll start
with the sample MDX query in the next section.

Working with the Report Designer in BIDS


As we explained in Chapter 20, when you work with the report designer in BIDS, you use the
Design and Preview tabs in the main work area to lay out and preview your reports. Recall
also from Chapter 20 that you’re actually creating Report Definition Language (RDL) meta-
data when you’re dragging and dropping items onto the designer area and configuring their
properties. Because RDL is an XML dialect, keep in mind that all information is case sensitive.
We find that case errors are a common source of subtle bugs in reports, so we start our dis-
cussion of report design with this reminder.

In addition to working with the designer surface, you’ll also use the Toolbox to quickly add
data and other types of visual decoration, such as images, to your reports. In SQL Server
2008, there are two new or improved types of report items. In Chapter 20, we took a brief
look at the new Gauge control. In this chapter, we explore the expanded capabilities of both
the table and matrix data regions shortly.

Before we do that, however, let’s examine another window you’ll want to have open when
designing reports—the Report Data window. To open it, go to the View menu, and click
the Report Data option). Figure 21-10 shows the Report Data window populated with the
datasets we created using the MDX query shown in the previous section. The Report Data
636 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

window in SQL Server 2008 replaces the Data tab in the SSRS BIDS designer used in SQL
Server 2005.

FigUre 21-10 Report Data window in BIDS

By taking a closer look at the Report Data window, you can see that in addition to the fields
defined in your MDX query —that is, Calendar_Year, Sales_Reason, and so on—you also have
access to fields defined as parameters in your query as both parameters and as dataset val-
ues. The configuration options differ for parameters and datasets. In addition to these fields,
SSRS includes a number of built-in fields that you can use in your report definitions. These
include the fields shown in Figure 21-10, such as Execution Time, Page Number, and so on.

Note If you do not see the second dataset for the ProductCategory parameter, you might have
to right-click on DataSource1 and select the Show Hidden Datasets option or right-click in an
empty spot in the Report Data window and select Show All Hidden Datasets.

If you want to further configure any of the items displayed in the Report Data window, click
the item of interest and then click Edit at the top of the window to open its Property configu-
ration window. If the Edit button is disabled, there are no editable properties for the selected
item. Configuration options vary depending on the type of object selected—that is, field,
table, parameter, and so on. We’ll take a closer look first at the DataSet Properties dialog box.

Select the ProductCategory dataset, and click Edit to open the dialog box. You can view or
change the source query, add or remove parameters, add or remove fields, set advanced
options (such as collation), and define dataset filters. This last item is particularly interesting
for BI-based reports because source queries can return large, or even huge, datasets. It might
be more efficient to filter (and possibly cache) the source query information on a middle-tier
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 637

SSRS server than to continually query an SSAS source server on each report render request.
We’ll go into more detail about scalability considerations in Chapter 22, “Advanced SQL
Server 2008 Reporting Services.” For now, we’ll just examine the mechanics of creating a filter
on a dataset.

To create a filter on a dataset, select the Filters option in the DataSet Properties dialog box
and then click Add. After you select the expression ([ParameterValue], for this example) and
operator (=), you define the value. Here you can supply a static value or use an expression to
define the value. To define an expression, click the fx button, which opens the Expression dia-
log box shown in Figure 21-11. Note that IntelliSense is available in this dialog box, which is a
welcome addition to the SSRS interface.

FigUre 21-11 Expression dialog box in SSRS in BIDS

The syntax in Figure 21-11 equates to setting the value to the first value in a field (Fields!)
collection, where the field collection is based on the ParameterValue field from the
ProductCategory dataset. The Expression editor colors strings (which should be delimited
using double quotes) brown and other syntax black. It also shows you syntax errors by add-
ing red (fatal) or blue (warning) squiggly lines underneath the syntax error in the statement.

Note If you use expressions to define values anywhere in an SSRS report, the syntax must be
correct. If not, the report will not render and an error will be displayed in the Errors window
in BIDS.
638 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Before we leave the Report Data window, let’s look at the configuration options for defined
parameters. Select the ProductCategory parameter in the Parameters folder, and click Edit
to see the Report Parameter Properties dialog box. This dialog box includes general proper-
ties, such as setting the prompt value and data type, as well as options for specifying the
source for the parameter values, such as the label value, and the value to use as a key. Using
this Properties dialog box, you can also specify the refresh rate for parameter values. The
default is set to Automatically Determine. Other options are Automatically Refresh and Never
Refresh. If your report includes long parameter lists, manually configuring the refresh rate
can affect report performance. After you’re satisfied with the data fields available for your
report, you’ll want to select data regions and other visual elements and begin to lay out your
report.

Understanding report items


The next step in report creation is to select the items and data you want to display on your
report. The quickest way to add layout items to your report is to drag them from the Toolbox
onto the designer surface. The Toolbox is shown in Figure 21-12. Some of the report items
can display data. These are known as data regions in SSRS. SSRS 2008 includes the following
data regions: the Table, Matrix, Textbox, List, Image, Subreport, Chart, and Gauge controls.
Of the data-bindable controls, some (such as Table, Matrix, and so on) contain intelligent
formatting and display capabilities; others (such as List) contain little or no formatting. List
allows you to define specialized layouts, such as repeating multicolumn tables.

FigUre 21-12 Toolbox for SSRS in BIDS

Most frequently, we use the Table, Matrix, Chart, and Gauge data region types for BI reports.
Figure 21-13 shows the Chart and Gauge controls on the designer surface in their initial con-
figurations. Note that you can drag as many types of controls as you want onto the report
designer surface. Also, as mentioned previously, you can nest certain types of controls
inside other controls as well. The Gauge control is new to SQL Server 2008 and represents
enhanced visual representation of data native to SSRS. In many of our past projects, we chose
to purchase third-party report controls to enhance the look and feel of end-user reports, so
we see inclusion of richer visual controls as a very positive change for SSRS.
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 639

FigUre 21-13 Common data regions for SSRS in BIDS

List and Rectangle Report Items


The list and rectangle report items have some similarities, the most notable of which is that
you can nest other data-bindable data regions inside of either one—that is, place a table
inside a list or have multiple matrices inside of a rectangle to more precisely control the data
display and layout. However, there are some differences between these two report item
types. A rectangle is a simple control with no particular format or dataset associated with it.
You use it when you want to control layout. A list functions similarly, but it’s a type of Tablix
container and is associated with a dataset. We explain exactly what that means in the next
section.

Tablix Data Region


New to SQL Server 2008 is the Tablix data region. Probably the first consideration you’ll
have is how to implement this type of data region in SSRS because it does not appear in
the Toolbox. As we mentioned in Chapter 20, when you drag data regions of type Table,
Matrix, or List onto the designer surface, you’re actually working with an instance of a Tablix
data region. Each of these variations on a Tablix data region is simply formatted in a differ-
ent starting default configuration. In other words, if you drag a Table data region onto the
designer surface, you get a Tablix data region that is presented in a (starting) table configura-
tion; if you drag a Matrix data region, you get a Tablix data region that is formatted to look
like a matrix; and so on.
640 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

MSDN explains Tablix data regions as follows:

Tablix enables developers to generate reports that combine fixed and dynamic
rows. Previously, layouts of this kind had to be developed by using multiple matrix
data regions and shrinking row headers. Support for Tablix data regions simplifies
the inclusion of combined static and dynamic data in reports, and extends the
formatting and layout capabilities of Reporting Services significantly.

You can read more about Tablix data regions at http://msdn.microsoft.com/en-us/library/


bb934258(SQL.100).aspx.

What does this really mean for you? It means that you can easily change and add features
from one type of structure—that is, a table with rows and columns—into a matrix with roll-
up totals in both the columns and rows areas. This flexibility is terrific for reports based on
OLAP cubes because it’s quite common for tabular reports to evolve into more matrix-like
structures.

Now let’s build a report using a Tablix data region. We start with a Table because that’s the
type of control we most often start with when displaying OLAP data. We’ll continue to work
with the parameterized query that we showed earlier in this chapter.

After dragging the Table control onto the designer surface, populate the fields with some
data values (Calendar_Year, Sales_Reason, and Internet_Sales_Amount) by dragging them
from the Report Data window. In this sample, we’ve included some built-in fields, such as
UserID, ReportName, and ExecutionTime. These values are populated at run time, and we fre-
quently include them in production reports. This is shown in Figure 21-14.

FigUre 21-14 Designing a basic OLAP report in BIDS

Of course you’ll probably spend much more time formatting your report output. You have
many different ways to add formatting to the items included in your report. We often use
the Properties dialog box associated with each item—that is, Tablix, Textbox, Image, and
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 641

so on. You can use the Properties dialog box associated with each item or the Formatting
toolbar. We won’t cover report object formatting in any more detail here because it’s pretty
straightforward.

You’ll also want to understand how to use the Tablix features. Using our sample, you can
right-click on the [Internet_Sales_Amount] cell, then click Add Group, and then Parent Group
in the Column Group section of the menu. Select [Internet_Sales_Amount] as the Group By
field in the resulting dialog box, and select the Add Group Header check box. This adds a
column grouping level on the [Internet_Sales_Amount] field. If you were to select the same
options in the Row Group section of the menu, you would create a row grouping on the
[Internet_Sales_Amount] field. Figure 21-15 shows only the single field we’ve mentioned and
the redesigned shortcut menus containing the available options for adding grouping rows or
columns to your report.

FigUre 21-15 Shortcut menus showing Tablix options in BIDS

After you select a text box containing an aggregate, the data region identifies the groups
in scope for that aggregate by adding an orange bar to the designer surface. Figure 21-16
shows the design output after you add the grouping level previously described.
642 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 21-16 Adding a grouping level

The flexibility available in Tablix data regions will improve your productivity when you create
reports based on OLAP data sources because you’ll frequently encounter changes in require-
ments as the project progresses. That is, your customers will frequently ask questions such as,
“Can you now summarize by x factor and y factor?”

In addition to using the Tablix functionality, you can show or hide grouping levels based on
other factors by configuring the grouping-level properties. You can do this in a number of
ways. One way is to access the Visibility property of the particular group (for example, row
or column) and then configure Visibility to be one of the following values: Show, Hide, or
Show Or Hide Based On An Expression. You may also select the Display Can Be Toggled By
This Report Item option, if you wish to allow the users to expand the amount of detail on the
report. This is a common requirement because report summary levels are often determined
by the level of detail that a particular level of management wants to view. For example, it’s
common for senior managers to want to view summary data, whereas middle managers
often prefer to drill down to the level or levels that they are most closely associated with.

You also have the ability to edit the displayed information in the new for 2008 Row Groups
and Column Groups sections, which appear at the bottom of the report designer. Here
is yet another place for you to edit additional grouping levels and add them to or delete
them from your report. We encourage you to explore the newly added shortcut menus in
the Row Groups and Columns Groups sections. Menu options include adding, editing, and
deleting groupings on rows and on columns. Additionally, if you decide to add a new group-
ing to your report via these shortcut menus, you can also select where you’d like to add it.
The options are Parent Group, Child Group, Adjacent Before, and Adjacent After. In general,
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 643

we really like the usability enhancements included in the SSRS report designer in BIDS and
believe that if you master the UI you’ll improve your report-writing productivity.

Because we are discussing changes and improvements to SSRS controls, we’ll mention that
in addition to the new Gauge control, another significant improvement in SSRS 2008 is the
overhaul of the Chart control. You can now choose from a greater variety of chart types, and
you have more control over the properties of the chart, which include more features to use
to better present data visually. Along with standard chart types—that is, column charts and
pie charts—there are many new chart types you can use in your reports. Here is a partial list
of the new chart types in SSRS 2008: stepped line, range, exploded pie, polar, radar, range
columm/bar, funnel, pyramid, and boxplot.

Using report Builder


Report Builder is a simplified report designer application. It was introduced in SQL Server
2005; however, it has been completely redesigned in SQL Server 2008. Report Builder was
released separately from SQL Server 2008, in October 2008. This text is based on Report
Builder version 2.0 RC1, which is available for download from http://www.microsoft.com/
downloads/details.aspx?FamilyID=9f783224-9871-4eea-b1d5-f3140a253db6&displaylang=en.
The released version’s features might differ slightly from the following discussion.

The first thing you’ll notice is that the design of the interface is quite similar to the design of
the report work area in BIDS. Among other UI changes, Microsoft has now included a ribbon-
like menu interface. You’ll also notice that the Data window is nearly identical to the Report
Data window in the report designer in BIDS. Also, at the bottom of the designer surface there
are sections for quick configuration of row groups and column groups, just like those that
have been added to the SSRS designer in BIDS.

In addition to what is immediately visible on the report designer surface, you’ll also find that
the properties dialog boxes have been redesigned from the previous version so that they
now match those found in the SSRS designer in BIDS. All of these UI changes result in better
productivity for report authors, whether they use BIDS or the report designer. Figure 21-17
shows the opening UI of the redesigned Report Builder.
644 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 21-17 Report Builder user interface

Report Builder includes the MDX Query Designer that we saw earlier in BIDS when you con-
nect to an Analysis Services data source. For simplicity, we’ll create a nonparameterized
report using the Adventure Works sample cube to walk you through the steps of creating a
sample report.

Click the Table Or Matrix icon on the designer surface to open the New Table Or Matrix
Wizard. The first step is to create a connection (data source). Then define a query (dataset) to
provide the data that you want to display on your report. Figure 21-18 shows the visual MDX
Query Designer. As with the report designer in BIDS, you can either design MDX queries visu-
ally or click the last button on the right side of the toolbar to change to manual query mode.

After you’ve written your query using the Design A Query page of the New Table Or Matrix
Wizard, on the Arrange Fields page of the wizard you lay out your results in the Tablix data
region. The wizard interface is well designed and allows you to intuitively lay out fields onto
the rows or columns axis. In addition, you can change the default measure aggregation from
SUM to any of 14 of the most common aggregate functions—such as MIN, MAX, and so
on—by clicking on the downward pointing triangle next to the measure value. This interface
is shown in Figure 21-19.
Chapter 21 Building Reports for SQL Server 2008 Reporting Services 645

FigUre 21-18 Report Builder visual MDX Query Designer

FigUre 21-19 The Arrange Fields page of the New Table Or Matrix Wizard

On the Choose The Layout page of the wizard, you configure selections related to displaying
subtotals or groups (with optional drilldown). Lastly, you can apply a predefined formatting
style to your report. As an example, we’ve applied a bit more formatting, such as selecting
text and marking it as bold, and we show you the results on the designer surface shown in
Figure 21-20.
646 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

If you don’t want to use the default layout, you can format any or all sections of the Tablix
control display by right-clicking on the section of interest and then configuring the associ-
ated properties. This behavior is nearly identical, by design, to the various formatting options
that we’ve already looked at when you use BIDS to format reports.

FigUre 21-20 Report Builder with a chart-type report on the designer surface

Summary
In this chapter, we described how to use SSRS as a report interface for BI projects. We inves-
tigated the MDX visual and manual query interfaces built into BIDS. Then we examined the
redesigned development environment, including the new Report Data window. After that,
we built a couple of reports using BI data and the new or improved data region controls. We
talked about the Tablix data region and populated it with the results of an MDX OLAP cube
query. We continued by talking about the future of the Report Builder client.

In the next chapter, we’ll look at advanced topics in SSRS related to BI projects, including
implementing custom .NET code and using the new Microsoft Office Word and Excel 2007
report viewing and exporting capabilities. We’ll then describe how to embed the report
viewer controls in a custom Windows Forms application. We’ll wrap up our look at SSRS by
reviewing a sample of coding directly against the SSRS API and looking at best practices
related to performance and scalability of SSRS.
Chapter 22
Advanced SQL Server 2008
Reporting Services
In this chapter, we take a look at some advanced concepts related to using SQL Server
Reporting Services (SSRS) in the Business Intelligence Development Studio (BIDS) environ-
ment. This includes integrating custom code modules and property configurations. We also
examine the new functionality for viewing SSRS reports in Microsoft Office Word and Excel
2007. Then we take a look at integrating SSRS into custom hosting environments, including
Windows Forms and Web Forms.

We also look at URL access, embedding the report viewer controls, and directly working with
the SSRS Simple Object Access Protocol (SOAP) API. Coverage of these topics is followed by
a discussion of deployment, which includes scalability concerns. We conclude the chapter
with a look at some of the changes to memory architecture and the Windows Management
Instrumentation (WMI) API in SSRS.

Adding Custom Code to SSRS Reports


We open our chapter with an advanced topic—using custom .NET code in SSRS reports. An
example of a business scenario for which you might to choose to write custom code is one
where you need complex, custom processing of input data, such as parsing XML documents
as input data for SSRS reports.

There are two approaches to doing this. Using the first approach, you simply type (or copy)
your code into the particular report of interest (in the Report Properties dialog box) and run
the .NET code as a script—that is, the code is interpreted each time the method or meth-
ods are called. The second approach, and the one we prefer, is to write (and debug) code in
Microsoft Visual Studio and then deploy the module to the global assembly cache (GAC) on
the report server. We use our preferred .NET language (C#) to write this custom code; how-
ever, you can use any .NET language as long as it will be compiled into a DLL. You are simply
creating business logic that can be reused by SSRS, so you will normally use the class file tem-
plate in Visual Studio to get started. You then write, debug, and optimize your class file or files.

After you’ve written and built your class file, you then edit the SSRS configuration file (rssvr-
policy.config) to add a reference to that DLL. Then reference the particular assembly from the
Report Properties dialog box mentioned earlier (and shown in Figure 22-1). Because you’re
now working with a compiled assembly, you gain all the advantages of working with com-
piled code—that is, performance, security, and scalability. To implement the business logic

647
648 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

in the report, you then invoke one or more of the methods included in your class file from
your SSRS report logic.

FiguRe 22-1 SSRS Report Properties dialog box

To implement the functionality defined in the assembly, you access its members via a report
expression. Most of the time, you create your assembly’s members as static—that is, belong-
ing to the class or type, rather than being associated with a particular instance of that class.
This makes them simpler to call from SSRS because they are global and need not be instanti-
ated on each use. The syntax for calling a static member is =Namespace.Class.Method. If you
are calling a particular method instance, the syntax in SSRS is =Code.InstanceName.Method.
Code is a keyword in SSRS. If you need to perform custom initialization of the object, you
override the OnInit method of the Code object for your report. In this method, you create an
instance of the class using a constructor, or you call an initialization method on the class to
set any specific values. You do this if you need to set specific values to be used on all subse-
quent calls to the object, such as setting one or more members to default values for particu-
lar conditions. (For example, if the report is being executed on a Saturday or Sunday, set the
Weekday property to False.)

The SSRS samples available on CodePlex demonstrate several possible uses of this type of
code extension to SSRS. These include custom renderers, such as printers, authentication,
and more. These samples can be downloaded from the following location: http://www.code­
plex.com/MSFTRSProdSamples.

Note that if you’re working with a custom assembly, you must reference both the namespace
and the class or instance name in your report. Another consideration when deploying custom
Chapter 22 Advanced SQL Server 2008 Reporting Services 649

code as an assembly is the appropriate use of code access security (CAS). This is because
assemblies can perform operations outside of the application boundary, such as requesting
data from an external source, which could be a database, the file system, and so on. CAS is
a policy-based set of code execution permissions that is used in SSRS. CAS default permis-
sions are set in the multiple *.config files used by SSRS, such as rssrvpolicy.config. For more
information, see the SQL Server Books Online topic “Understanding Code Access Security in
Reporting Services.”

Custom code modules are typically used to provide the following types of functionality: cus-
tom security implementation, complex rendering, and data processing. The most common
use of this type functionality that we’ve provided for our clients revolves around custom data
processing (which occurs prior to report rendering). Specifically, we’ve implemented complex
if…then…else or case statements that transform data prior to rendering it using custom code
modules in SSRS.

Another business scenario that has caused us to implement custom report extensions is one
in which the customer wants to create an “export to printer” mode of report rendering. This
new rendering extension, after being properly coded and associated with your SSRS instance,
appears as an additional choice for rendering in all client drop-down list boxes, such as in the
SSRS Report Manager Web site. CodePlex has a well-documented sample of this functionality
to get you started if you have a similar requirement. You can find this sample at http://www.
codeplex.com/MSFTRSProdSamples.

After you’ve completed report development, your next consideration is where you’d like
to host and display your reports for end users to be able to access them. As we’ve men-
tioned, the default hosting environment is the Report Manager Web site provided with SSRS.
Although a couple of our clients have elected to use this option, the majority have preferred
an alternate hosting environment. We’ve found that the primary drivers of host environment
selection are security model, richness of output, and sophistication of end-user groups.

Another consideration is whether other Microsoft products that can host business intel-
ligence information have already been deployed, such as Excel or Office SharePoint Server
2007. In Chapter 23, “Using Microsoft Office Excel 2007 as an OLAP Cube Client,” we’ll cover
Excel hosting in detail. In Chapter 25, “SQL Server Business Intelligence and Microsoft Office
SharePoint Server 2007,” we’ll discuss Office SharePoint Server 2007 (Report Center) hosting.
In the next few sections of this chapter, we’ll take a look at some alternatives for hosting to
those that we’ve just listed. We’ll start by exploring direct Office 2007 viewing.

Viewing Reports in Word or excel 2007


The ability to render SSRS reports in Office file formats has been significantly enhanced
in SSRS 2008. In SSRS 2005, you could render reports as Excel only. In SSRS 2008, you can
render in a Word (2000 or later) format. Also, the Excel rendering has been enhanced. The
650 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

simplest way to use the new Word rendering is to view a report in the default Web site, select
the Word format from the list of included rendering options, and then click the Export link to
save the report in a .doc format. This is shown in Figure 22-2.

FiguRe 22-2 Render formats in SSRS 2008 now include Word.

In addition to the inclusion of Word rendering, SSRS 2008 has improved existing renderers
for Excel and CSV. Excel rendering with SSRS 2008 now supports nested data regions such as
subreports. In SSRS 2005, a matrix rendered to a CSV file produced a cluttered format that
was difficult to use. The output for this format has been cleaned up and simplified in 2008,
and it is now easier to import into applications that support CSV files.

If you want to further customize the SSRS output that is rendered to either a Word or an
Excel format, you can use the Visual Studio Tools for Office (VSTO) template in Visual Studio
2008 to programmatically extend BI reports that were being rendered in either of the sup-
ported formats.

For more information (including code samples) about custom VSTO development, see the following
link: http://code.msdn.microsoft.com/VSTO3MSI/Release/ProjectReleases.aspx?ReleaseId=729. VSTO
is discussed in more depth in Chapter 23.

You can choose to create your own Web site to host BI reports. As mentioned, there are a
couple of approaches to take here. The simplest is to simply use URLs to access the reports.
We’ll talk about that next.
Chapter 22 Advanced SQL Server 2008 Reporting Services 651

uRL Access
Because of the simplicity of implementation, many of our customers choose to host BI
reports using SSRS and link each report’s unique URL to an existing Web site. Microsoft has
made a couple of useful enhancements to the arguments you can pass on the URL. Before we
cover that, let’s take a look at a sample URL so that you can understand the URL syntax used
to fetch and render a report:

http://servername/reportserver?/SampleReports/Employee Sales Summary&rs:Command=Ren­


der&rs:format=HTML4.0

This example fetches the information about the Employee Sales Summary report and ren-
ders it to an HTML 4.0 output type. For a complete description of the URL syntax, see the
SQL Server Books Online topic “URL Access” at http://msdn.microsoft.com/en­us/library/
ms153586.aspx.

Although it’s typical to use URL access for SSRS BI reports from Web applications, it’s pos-
sible to use this method from custom Windows Forms applications. The latter approach still
requires the use of a Web browser (usually embedded as a WebBrowser control on the form).
The SQL Server Books Online topic “Using URL Access in a Windows Application,” which is
available at http://msdn.microsoft.com/en­us/library/ms154537.aspx, details exactly how to
do this.

New to SQL Server 2008 is the ability to work with estimated report total page counts
through URL access. This functionality has been added because page rendering has changed
in this version of SSRS and it’s not necessarily possible to know the full range of page
numbers as the report is processed. By using URL access, you can provide the argument
&rs:PageCountMode=Estimate to use an estimated page count or &rs:PageCountMode=Actual
to use the actual page count. Of course, more overhead is associated with the Actual value
because all pages must be rendered in order to obtain a final page count. The SSRS inter-
face reflects this change in the UI by adding a question mark after the page count in SSRS as
shown in Figure 22-3. If the page count is an actual count, the question mark will not appear
in the SSRS interface. The URL syntax needed if you want to use an estimated page count is
as follows: http://servername/reportserver?/Adventure Works Sales/Sales Person Directory
&rs:PageCountMode=Estimate.

FiguRe 22-3 Estimated pages in SSRS interface


652 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Another consideration when implementing URL access is how you choose to handle security
credentials. You have a couple of options to address this. Most of our clients have chosen to
implement Windows authentication for their intranet scenarios and to configure the included
SSRS top-level URL as part of the intranet security zone so that the credentials of the logged-
on user will be automatically passed to SSRS.

If you’re implementing URL access in an Internet scenario (or a scenario where the user
needs to provide logon credentials), you should set the SSRS data source to prompt for
credentials. You can then include the credentials as part of the URL by including the
prefix:datasourcename=value parameter string, where prefix is either dsu (for user name) or
dsp (for password) and datasourcename is the name of the data source for which to supply
credentials. For example, if you want to connect to the AdventureWorks2008 data source
with a user name of testuser and a password of password, you include the following values in
the URL: dsu:AdventureWorks2008=testuser&dsp:AdventureWorks2008=password.

Of course, you should be using Secure Sockets Layer (SSL) encryption in these scenarios so
that the transmission is encrypted. The user name and password should have the minimum
privileges required to get the data for the report, such as read-only access to the database.
Alternatively, you could supply the credentials via other methods, such as programmatically,
if the security risk of passing credentials on the URL are unacceptable.

Another way to implement URL access is to use the Microsoft ReportViewer controls in either
a Web Forms or Windows Forms application to be able to display a BI report that is hosted in
SSRS. We’ll detail the process to do this in the next section.

embedding Custom ReportViewer Controls


Microsoft provides two controls in Visual Studio 2008 that allow you to embed SSRS reports
(or link to an existing SSRS report hosted on an SSRS instance) in your custom Windows
Forms or Web Forms applications. Alternatively, you can also design some types of reports
from within Visual Studio and then host them in your custom applications. In Visual Studio
2005, Microsoft provided two different tools in the Toolbox to represent the types of this
control; in Visual Studio 2008, the different modes of report access have been incorporated
into a single control in the Toolbox for each type of client application. These controls appear
in the Reporting section of the Toolbox in Visual Studio 2008 when you use either the ASP.
NET Web Application or Windows Form Application project templates.

Note If you plan to use the SSRS ReportViewer control, you can choose to install both Visual
Studio 2008 and SQL Server 2008 on the same physical machine. Also, a full version of Visual
Studio (not the Express edition) is required. To install both you need to install Service Pack 1 (SP1)
for Visual Studio 2008 to ensure compatibility between BIDS and Visual Studio 2008.
Chapter 22 Advanced SQL Server 2008 Reporting Services 653

The two report processing modes that this control supports are remote processing mode
and local processing mode. Remote processing mode allows you to include a reference to
a report that has already been deployed to a report server instance. In remote process-
ing mode, the ReportViewer control encapsulates the URL access method we covered in
the previous section. It uses the SSRS Web service to communicate with the report server.
Referencing deployed reports is preferred for BI solutions because the overhead of rendering
and processing the often large BI reports is handled by the SSRS server instance or instances.
Also, you can choose to scale report hosting to multiple SSRS servers if scaling is needed for
your solution. Another advantage to this mode is that all installed rendering and data exten-
sions are available to be used by the referenced report.

Local processing mode allows you to run a report from a computer that does not have SSRS
installed on it. Local reports are defined differently within Visual Studio itself, using a visual
design interface that looks much like the one in BIDS for SSRS. The output file is in a slightly
different format for these reports if they’re created locally in Visual Studio. It’s an *.rdlc file
rather than an *.rdl file, which is created when using a Report Server Project template in BIDS.
The *.rdlc file is defined as an embedded resource in the Visual Studio project. When display-
ing *.rdlc files to a user, data retrieval and processing is handled by the hosting application,
and the report rendering (translating it to an output format such as HTML or PDF) is handled
by the ReportViewer control. No server-based instance of SSRS is involved, which makes it
very useful when you need to deploy reports to users that are only occasionally connected
to the network and thus wouldn’t have regular access to the SSRS server. Only PDF, Excel, and
image-rendering extensions are supported in local processing mode.

If you use local processing mode with some relational data as your data source, a new report
design area opens up. As mentioned, the metadata file generated has the *.rdlc extension.
When working in local processing mode in Visual Studio 2008, you’re limited to working with
the old-style data containers—that is, table, matrix, or list. The new combined-style Tablix
container is not available in this report design mode in Visual Studio 2008.

Both versions of this control include a smart tag that helps you to configure the associated
required properties for each of the usage modes. Also, the ReportViewer control is freely
redistributable, which is useful if you’re considering using either version as part of a commer-
cial application.

There are several other new ReportViewer control features in Visual Studio 2008. These
include new features for both the design-time and run-time environments. Design-time fea-
tures include the new Reports Application project template type. This is a project template
that starts the Report Wizard (the same one used in BIDS to create *.rdlc files) after you first
open the project. The wizard steps you through selecting a data source, choosing a report
type (tabular or matrix), defining a layout, and formatting the report. Also, the SSRS expres-
sion editor (with IntelliSense) is included. Local reports that include expressions created in
Visual Studio 2008 can include expressions written in Visual Basic .NET only. At runtime,
654 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

PDF compression (which results in “exporting to PDF” format automatically compressing the
report) is added.

Using the ReportViewer control in a custom application adds two namespace references to
your project: Microsoft.ReportViewer.Common and Microsoft.ReportViewer.WinForms (or Web
Forms for Web applications). Because you use the ReportViewer control in local mode with
a Windows Forms application in scenarios where you want to design the report at the same
time you’re creating the form, we see the ReportViewer control in local mode being used
more often in OLTP reporting than in BI reporting. We believe that most BI developers will
create their BI reports first in BIDS and then use the ReportViewer control in a custom appli-
cation to provide access (using remote processing mode) to that report.

For this example, you’ll create a simple Windows Forms application to display a sample
SSAS cube report. After you open Visual Studio 2008 and start to design a Windows Forms
application by clicking File, New Project, C# (or Visual Basic .NET), and then Windows Form
Application, drag an instance of the MicrosoftReportViewer control from the Reporting sec-
tion of the Toolbox onto the form designer surface. The Toolbox, ReportViewer control, and
some of the control’s properties are shown in Figure 22-4.

FiguRe 22-4 ReportViewer control for a Windows Forms application

After you drag the ReportViewer onto the form’s designer surface, you’ll see that a smart tag
pops up at the upper right side of the control. This smart tag allows you to enter the URL
to an existing report (remote processing mode) or to design a new report (local process-
ing mode) by clicking the Design A New Report link on the smart tag. As mentioned, if you
use local processing mode, no license for SQL Server Reporting Services is needed and all
Chapter 22 Advanced SQL Server 2008 Reporting Services 655

processing is done locally (that is, on the client). Unfortunately, this mode does not support
SSAS cubes as a data source. It does, of course, support using SQL Server data (and other
types of relational data) as data sources. There are other significant limitations when using
the local processing mode, including the following:

■■ Report parameters defined in the *.rdlc file do not map to query parameters automati-
cally. You must write code to associate the two.
■■ *.rdlc files do not include embedded data source (connection) or query information.
You must write that code.
■■ Browser-based printing via the RSClientPrint ActiveX control is not part of client-run
reports.

Tip You can connect the ReportViewer control to an Object data source programmatically. In
this way, you can connect to any BI object. A well-written example of this technique, including
code samples, is found in Darren Herbold’s blog at http://pragmaticworks.com/community/blogs/
darrenherbold/archive/2007/10/21/using­the­winform­report­viewer­control­with­an­object­data­
source.aspx.

Because you’ll mostly likely use only the remote processing mode to display reports built on
your SSAS cubes and mining structures, your considerations when using the Windows Forms
ReportViewer control will be the following settings (when configured using the smart tag
associated with the ReportViewer control named ReportViewer Tasks, as shown in Figure
22-5 and described here):

■■ Choose Report Here you either select <Server Report> for remote processing mode
or leave this value blank for local processing mode. Selecting <Server Report> changes
the values in the smart tag to those in the remainder of this list. If you leave this value
blank and then click on the Design A New Report link, the included report designer in
Visual Studio opens a blank *.rdlc designer surface.
■■ Report Server Url This string is configured in the smart tag on the ReportViewer con-
trol and is in the form of http://localhost/reportserver.
■■ Report Path This string is configured in the smart tag on the ReportViewer con-
trol and is in the form of /report folder/report name—for example, /AdventureWorks
Sample Reports/Company Sales. Be sure to remember to start the path string with a
forward slash character (/). Also, you cannot include report parameters in this string.
■■ Dock In Parent Container This is an optional switch available via a linked string in the
smart tag for the control. It causes the ReportViewer control to expand to fill its current
container (the form in this example).
656 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FiguRe 22-5 ReportViewer Tasks settings

Figure 22-6 shows the ReportViewer control used in a Windows Forms application, displaying
a simple sample report. We’ve chosen only to return the Calendar Year labels and Internet
Sales Amount totals from the Adventure Works DW OLTP sample database in this sample
report.

By setting the optional Dock In Parent Container option to True, the report surface fills the
entire Windows Forms display area.

FiguRe 22-6 Rendered report using the ReportViewer control

About Report Parameters


If you’re connecting to a report in remote processing mode and the report expects param-
eter values, the ReportViewer control header area can provide a UI automatically for entering
or selecting the particular parameter value. If the ShowParameterPrompts property is set to
True, the prompt is displayed in the top area of the control. You have the option of setting
Chapter 22 Advanced SQL Server 2008 Reporting Services 657

the ShowParameterPrompts property to False and handling the parameter entry yourself.
To do this, you must provide a parameter input area in the form or Web page. That is, you
must add some type of control—such as a text box, drop-down list, and so on—to display
and to allow the end users to select the parameter input values. You can then pass the value
from the control as a parameter to the report by using the SetParameters method that the
ReportViewer control exposes through the ServerReport class. You can also use this technique
to set SSRS parameters that are marked as hidden (in the Report Parameters dialog box). You
can see a code sample at the following blog entry: http://pragmaticworks.com/community/
blogs/darrenherbold/archive/2007/11/03/using­the­reportviewer­control­in­a­webform­with­
parameters.aspx.

The only way to display parameters when using local processing mode is programmatically.
You use the same method just described: add the appropriate controls to the form to allow
the user to select the parameter values. The difference is that you use the SetParameters
method of the LocalReport class to apply them to the report.

Tip As an alternative to linking to a report hosted on the SSRS server for which you’ve added
parameters at design time, you can programmatically populate parameters. The following blog
entry has a well-written description of this technique: http://blogs.msdn.com/azazr/archive/2008/
08/15/parameterize­the­olap­reports­in­client­applications.aspx.

About Security Credentials


The report credentials being passed through the ReportViewer control vary depending on
how the application developer has configured the application and the type of application
(that is, Windows Forms or Web Forms). For all types of credentials, remember to verify that
the credentials being used in the application have access to the desired report on the SQL
report server and have appropriate permissions in the database the information is being
retrieved from, whether it’s a relational source or an Analysis Services database. If you’re
working with Web Forms, the default authentication type is Windows Integrated Security.
Remember from Chapter 5, “Logical OLAP Design Concepts for Architects,” that there are
also limits on how many times a user security token can be passed to other computers.

If you need to support a custom authentication provider other than Windows, you’ll need to
support those requirements programmatically. Remember to consider both authentication
and authorization strategies related to the type of authentication provider, such as provid-
ing a logon page for the user to enter crendentials, and so on. Note that the ReportViewer
control does not provide pages for prompted credentials. If your application connects to a
report server that uses custom authentication, (that is, one that is forms based), as mentioned
you must create the logon page for your application.
658 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

If you’re implementing the ReportViewer control for an ASP.NET application in remote pro-
cessing mode, you might want to configure connection information in the web.config file for
the application. Specifically, you can optionally configure a ReportViewerServerConnection
key value to store connection information in the event you’ve implemented your Web appli-
cation with session state storage turned off. For more information, see the SQL Server Books
Online topic “Web.config Settings for ReportViewer” at http://msdn.microsoft.com/en­us/
library/ms251661.aspx.

Note Projects using the Visual Studio 2005 ReportViewer control are not automatically
upgraded. You must manually change references in your project to use the new control.

We have encountered business requirements that necessitate that we code directly against
the SSRS API. We’ll talk about the why’s and how’s of that scenario next.

About the SOAP APi


If you’re a developer, you might be saying to yourself, “Finally, I get to write some code here.”
As with most other aspects of BI solutions, there is a good reason that we’ve placed this sec-
tion near the end of the chapter. We’ve found many overengineered solutions among our
clients. We again caution that writing code should have a solid business justification because
it adds time, cost, and complexity to your reporting solution. That being said, what are some
of the most compelling business scenarios in which to do this?

We’ll use some of our real-life examples to explain. Although for small-to-midsized clients,
using URL access or the ReportViewer control has worked well for us, we’ve had enterprise
clients for which these solutions proved to be feature deficient. Specifically, we’ve imple-
mented direct calls to the SSRS Web service API for the following reasons:

■■ Custom security implementation Most often, this situation includes very specific
access logging requirements.
■■ A large number of custom subscriptions Sometimes this situation includes advanced
property configurations, such as snapshot schedules, caching, and so on.
■■ A large number of reports being used in the environment We’ve achieved more
efficient administration at large scales by coding directly and implementing the client’s
specific requirements—for example, report execution schedules.
■■ Complex custom rendering scenarios This last situation requires quite a bit of cus-
tom coding such as overriding the Render method. Although this solution is powerful
because it gives you complete control, it’s also labor intensive because you lose all the
built-in viewing functionality, such as the report toolbar.
Chapter 22 Advanced SQL Server 2008 Reporting Services 659

If you’re planning to work directly with the API, you’ll be working with one of two catego-
ries (management or execution) of Web service endpoints. The management functionality is
exposed through the ReportService2005 and ReportService2006 endpoints.

The ReportService2005 endpoint is used for managing a report server that is configured in
native mode, and the ReportService2006 endpoint is used for managing a report server that
is configured for SharePoint integrated mode. (We’ll discuss SharePoint integrated mode in
Chapter 25.) The execution functionality is exposed through the ReportExecution2005 end-
point, and it’s used when the report server is configured in native or SharePoint integrated
mode. As with all Web service development, you must know how to access the service, what
operations the service supports, what parameters the service expects, and what the service
returns. SSRS provides you with a Web Service Description Language (WSDL) file, which pro-
vides this information in an XML format. If you prefer to consult the documentation first, you
can read about the publicly exposed methods in the SQL Server Books Online topic “Report
Server Web Service Methods” at http://msdn.microsoft.com/en­us/library/ms155071.aspx.

As with other custom clients, you must consider security requirements when implementing
your solution. The API supports Windows or basic credentials by default. The syntax for pass-
ing Windows credentials via an SSRS proxy is as follows:

ReportingService rs = new ReportingService();


rs.Credentials = System.Net.CredentialCache.DefaultCredentials;

To pass basic credentials, you use this syntax:

ReportingService rs = new ReportingService();


rs.Credentials = new System.Net.NetworkCredential("username", "password", "domain");

You could, of course, also implement custom security by writing a custom authentication
extension for SSRS. For an example, see the SQL Server Books Online topic “Implementing a
Security Extension” at http://msdn.microsoft.com/en­us/library/ms155029.aspx.

An additional security consideration is the ability to require SSL connections for selected Web
methods. You can configure these requirements by setting the appropriate value (0 through
3) in the SecureConnectionLevel setting in the RSReportServer.config file. As you increase the
value of this setting, more Web methods are required to be called over a secure connection.
For example, you might set the value to 0, like this:

<Add Key=SecureConnectionLevel Value=0/>

This means that all Web service methods can be called from non-SSL connections. On the
other hand, setting SecureConnectionLevel to 3 means that all Web service methods must be
invoked over an SSL connection.

There is an interesting new item in the SQL Server 2008 SSRS ReportExecution2005 Web
service that corresponds to the estimated page count that we saw earlier in the URL access
660 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

section. In SSRS 2005, there was a class named ExecutionInfo that contained a NumPages
property, which could be used to retrieve the actual page count. In SSRS 2008, the
ExecutionInfo class has been extended as ExecutionInfo2. The new class has an additional
property, PageCountMode, which can be set to Actual or Estimate to control the estimating
behavior. The Web service now includes several extended classes and methods to support
the new functionality in SSRS 2008, including ExecutionInfo2, LoadReport2, ResetExecution2,
Render2, and more.

What Happened to Report Models?


SSRS 2005 introduced the ability to create a semantic layer between your database query
and report. This was expressed as the report model. You can create report models by using
a template in BIDS (for relational sources), or you can create them automatically by click-
ing a button on the data source configuration Web page in the default Web site (for OLAP
sources). Although this functionality is still included in SSRS 2008, we’ll be using report mod-
els less often in SSRS 2008. This is because the original reason to create these semantic mod-
els was that they were required to be used as sources for designing reports using the Report
Builder tool.

In the RTM release of Report Builder 2.0 for SSRS 2008, these semantic models were no lon-
ger required as a source for report creation. Report Builder 2.0 allows you to use report mod-
els or a direct connection to your data (whether relational or OLAP) as a source for reports.
Also, using Report Builder 2.0 you can now create .rdl-based reports using a drag-and-drop
interface that is similar to the report design interface found in BIDS for SSRS.

Figure 22-7 shows a report model built from the OLTP AdventureWorksLT database.

FiguRe 22-7 Report Model tab in BIDS


Chapter 22 Advanced SQL Server 2008 Reporting Services 661

After you deploy a report model to the report server, you can configure associated proper-
ties in the default Web site. New to SSRS 2008 is the ability to configure clickthrough reports,
item-level security, or both in the default Web site. This is shown in Figure 22-8.

FiguRe 22-8 Report model clickthrough permissions in BIDS

After you deploy your semantic model, you can use it as a basis for building a report using
the SSRS old-style Report Builder interface, shown in Figure 22-9, or you can use it as a
source in the new Report Builder 2.0.

FiguRe 22-9 Report Builder 1.0 interface using a report model as a data source
662 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

To better understand how Report Designer, Report Builder, and report models work
together in SSRS 2008, we suggest you read Brian Welcher’s blog at http://blogs.msdn.com/
bwelcker/archive/2007/12/11/transmissions­from­the­satellite­heart­what­s­up­with­
report­builder.aspx.

Figure 22-10 provides a conceptual view of the SSRS report-creation tools.

Report Report
Designer 2008 Builder 2008
Full RDL Support
All Data Sources
Shared Layout Surface
Shared Dialog Boxes
Visual Studio Office 12
Integration Look and Feel
Report
Models

Integrated Query and Layout


Clickthrough Link Generation

Report
Builder 2005

FiguRe 22-10 Conceptual view of SSRS report-creation tools

The key piece of information for BI is that semantic models are no longer required to build
reports using an SSAS database as a data source with the Report Builder 2008 tool. Report
models continue to be supported, but they’re not required.

Deployment—Scalability and Security


For some BI scenarios, you have enough information at this point to proceed with your
implementation. However, SSRS is designed for scalability and we want to include some
information on those capabilities because we’ve worked with some clients whose business
requirements were such that we chose to implement some of the included scalability fea-
tures. The scale-out feature (Web farms) that we present here requires the Enterprise edition
of SSRS. For a complete list of features by version, go to http://msdn.microsoft.com/en­us/
library/cc645993.aspx (reporting section).
Chapter 22 Advanced SQL Server 2008 Reporting Services 663

Performance and Scalability


As the load increases on your SSRS instance because of larger reports or more end users
accessing those reports, you might want to take advantage of one or more of the scaling
features available in SSRS. These include caching, using snapshots, and scaling out the SSRS
server itself. The simplest way to configure caching or snapshots is on a per-report basis
using the management interface in the Report Manager Web site.

You might also want to use some of the included Windows Performance counters for
SSRS during the pilot phase of your project to test your server instance using produc-
tion levels of load. These counters are detailed in the SQL Server Books Online topics
“Performance Counters for the MSRS 2008 Web Service Performance Object” (http://tech­
net.microsoft.com/en­us/library/ms159650.aspx) and “Performance Counters for the MSRS
2008 Windows Service Performance Object” (http://technet.microsoft.com/en­us/library/
ms157314.aspx).

You can easily configure snapshot and timeout settings globally using the SSRS Site Settings
page (General section) as shown in Figure 22-11.

FiguRe 22-11 SSRS Site Settings dialog box

You can use the Report Manager Web site to configure caching or snapshots for an individual
report. Select the report you want to configure, and then click the Properties tab and choose
the Execution section. Here you can also configure an execution timeout value for potentially
664 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

long-running reports as shown in Figure 22-12. When you choose to schedule report execu-
tion to reduce load, it’s a best practice to include the built-in field value ExecutionTime on the
report so that your end users can be aware of potential data latency.

FiguRe 22-12 SSRS report execution settings

We do caution that if you find yourself needing to use the execution time-out value for a
report using an SSAS database as a data source, you might want to reevaluate the quality
of the MDX query. Be sure to include only the data that is actually needed in the result set.
Tracing with Profiler can help you to evaluate the query itself.

Another optimization technique is to add aggregations to the cube at the particular dimen-
sion levels that are referenced in the query. This is a scenario where you can choose to use
the Advanced view in the aggregation designer to create aggregations at specific inter-
sections in the cube. We discussed this technique in Chapter 9, “Processing Cubes and
Dimensions.”

New in SQL Server 2008 SSRS is the ability to configure a memory threshold for report pro-
cessing. In previous releases, the report server used all available memory. In this release, you
can configure a maximum limit on memory as well as interim thresholds that determine how
the report server responds to changes in memory pressure. You do this by making changes
to the default settings in the RSReportServer.config file. To help you understand these set-
tings, we’ll first cover some background and then detail the process to make changes.
Chapter 22 Advanced SQL Server 2008 Reporting Services 665

Advanced Memory Management


Because of some significant changes in SSRS architecture, you now have the ability to imple-
ment more fine-grained control over memory usage in SSRS. To show you how this works,
we’ll first review how memory was allocated in SSRS 2005 and then compare that with the
changes in the SSRS 2008 memory architecture. Also, we caution that making any changes to
the default configuration should be documented and tested with production levels of load.

To start, we’ll review some challenges that occurred in SSRS 2005 because of the memory-
management architecture. Generally, reports being bound by memory limited scalability, and
sometimes large reports running interactively caused memory exceptions. This could also
cause smaller reports to be starved for resources by larger reports. Also, page-to-page navi-
gation could result in increasing response times as the number of pages increased.

The object model computed calculations before storing reports in an intermediate format.
This sometimes resulted in problems such as inconsistent rendering results when paging for-
ward and backward, or inconsistent pagination across rendering outputs. The problem was
that all data had to be processed before you could start rendering. Each rendering extension
read from a report object model and did pagination. Output of rendering extensions was
done by the WebForms control consuming HTML, or by consuming a serialized format or
image renderer, the WinForms control, or the Print control.

In SSRS 2008, several architectural changes were implemented to resolve these issues. The
first change is that the grouping of data has been pulled out of the data regions, which
results in consistent grouping. As mentioned, all data region types have been replaced with
the Tablix type. The chart data region has been kept separate because of the additional
properties required for visualization.

The second change is that the results of processing the data regions are stored in intermedi-
ate format before the calculations are executed. This means those calculations can be done
on the fly because raw data is always available.

The third change is that the rendering object model is invoked by rendering an extension
for a specific page. In other words, you have to calculate expressions only on a particular
requested page. This is an iterative process based on the grouping or groupings in scope on
the page being viewed. The report object model is abstracted into three module types:

■■ Soft page layout Interactive rendering in which there is no concept of a page


■■ Data Outputs data directly
■■ Hard page layout PDF and image files always have same pagination

The fourth change is that some rendering to the client (HTML or image) is offloaded. This
was changed because in previous versions where the rendering was done on the server, that
rendering was at the resolution of the server rather than of the client. If the resolution was
666 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

higher on the client, inconsistent results were produced. This also has the effect of improving
performance by offloading work from the SSRS server.

Memory management allows larger reports to be successfully (but more slowly) processed—
in previous versions, those reports would sometimes consume all the available memory on
the computer and fail. The goal of manually tuning memory is to reduce out-of-memory
exceptions.

To tune memory manually, you make changes to the SSRS RSReportServer.config configu-
ration file. Microsoft recommends changing the configuration settings only if reports are
timing out before processing completes. Note that, by default, only MemorySafetyMargin
and MemoryThreshold are configured. You must manually add the other settings to the con-
figuration file. Table 22-1 summarizes the configuration values and the areas of SSRS memory
that configuration changes to these areas affect.

TAbLe 22-1 SSRS Memory Configuration Settings


Setting Name Description
WorkingSetMaximum This value (expressed in kilobytes) controls the maximum amount of
memory the report server can use. By default, this is set to the amount
of available memory on the computer.
MemoryThreshold This value (expressed as a percentage of WorkingSetMaximum) defines
the boundary between a medium and high memory pressure scenario.
By default, it’s set to 90.
MemorySafetyMargin This value (expressed as a percentage of WorkingSetMaximum) defines
the boundary between a low and medium memory pressure scenario.
By default, it’s set to 80.
WorkingSetMinimum This value (expressed in kilobytes) controls the minimum amount of
memory the report server keeps reserved. By default, this is set to 60
percent of WorkingSetMaximum.

There are also several new performance counters available for monitoring service activity.
ASP.NET performance counters no longer detect report server events or activity.

In addition to tuning the SSRS instance as described in this section so far, you might also
elect to run SSRS on more than one physical machine. This is called scaling out, and we’ll talk
about this technique in the next section.

Scaling Out
The Enterprise edition of SSRS supports scaling out—that is, using more than one physi-
cal machine to support the particular SSRS solution that runs from a common database. To
implement a scaled-out solution, you use the Reporting Services Configuration Manager
tool (Scale-Out Deployment section). This is also called a Web farm. SSRS is not a cluster-
aware application; this means that you can use network load balancing (NLB) as part of your
Chapter 22 Advanced SQL Server 2008 Reporting Services 667

scale-out deployment. For more information, see the SQL Server Books Online topic “How
to: Configure a Report Server Scale-Out Deployment (Reporting Services Configuration)” at
http://msdn.microsoft.com/en­us/library/ms159114.aspx.

You must also manage the encryption key across each instance by backing up the generated
encryption key for each instance using either the Reporting Services Configuration Manager
or the rskeymgmt.exe command-line utility included for scriptable key management. Figure
22-13 shows the Scale-Out Deployment interface.

FiguRe 22-13 SSRS Scale-out deployment

A typical scaled-out SSRS implementation includes multiple physical servers. Some of these
servers distribute the front-end report rendering via a network load balancing type of sce-
nario. You can also add more physical servers to perform snapshots or caching in enterprise-
sized implementations. For more strategy implementation details on scale-out deployments
for SSRS, see the following post: http://sqlcat.com/technicalnotes/archive/2008/10/21/report­
ing­services­scale­out­deployment­best­practices.aspx.

Another topic of interest when managing your SSRS instance is using scripting to manage
SSRS administrative tasks. We’ll take a look at this in the next section.

Administrative Scripting
SSRS includes a scripting interface (rs.exe) that allows administrators to execute scripts written
in Visual Basic .NET from the command line as an alternative to using the management pages
in the Reports Web site. This tool is not supported when you’re using SharePoint integrated
mode with SSRS.
668 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

This tool includes many switches that allow you to further configure the script’s execu-
tion. In addition to built-in support for accessing many of the administrative Web service
methods, such as report deployment or management, there are also additional scripts
available on CodePlex. These scripts provide examples of automating other routine main-
tenance tasks, such as managing scheduled jobs and setting report server–level proper-
ties. For example, there are some custom scripts in CodePlex at http://www.codeplex.com/
MSFTRSProdSamples/Wiki/View.aspx?title=SS2008%21Script%20Samples%20%28Report­
ing%20Services%29&referringTitle=Home. These include sample scripts that allow you to
programmatically add report item security, manage running SSRS jobs, and more. In addition
to using rs.exe, you can also create and execute administrative scripts using the WMI provider
for SSRS.

Using WMI
The SSRS Windows Management Instrumentation (WMI) provider supports WMI operations
that enable you to write scripts and code to modify settings of the report server and Report
Manager. These settings are contained in XML-based configuration files. Using WMI can
be a much more efficient way to make updates to these files, rather than manually editing
the XML.

For example, if you want to change whether integrated security is used when the report
server connects to the report server database, you create an instance of the MSReportServer_
ConfigurationSetting class and use the DatabaseIntegratedSecurity property of the report
server instance. The classes shown in the following list represent Reporting Services compo-
nents. The classes are defined in either the root\Microsoft\SqlServer\ReportServer\<Instance­
Name>\v10 or the root\Microsoft\SqlServer\ReportServer\<InstanceName>\v10\Admin
namespace. Each of the classes supports read and write operations. Create operations are not
supported.

■■ MSReportServer_Instance class Provides basic information required for a client to


connect to an installed report server.
■■ MSReportServer_ConfigurationSetting class Represents the installation and run-time
parameters of a report server instance. These parameters are stored in the configura-
tion file for the report server.

As with writing scripts and executing them using the rs.exe utility, you can also use the SSRS
WMI provider to automate a number of administrative tasks, such as reviewing or modify-
ing SSRS instance properties, listing current configuration settings, and so on. The ability to
make these types of configuration changes programmatically is particularly valuable if you
need to apply the same settings across a scaled-out farm of SSRS servers or to make sure that
multiple environments are configured the same way. The Reporting Services Configuration
Manager and rsconfig.exe utility use the WMI provider.
Chapter 22 Advanced SQL Server 2008 Reporting Services 669

Note In SQL Server 2008 SSRS, there are a couple of changes to the WMI API. These include
changing the WMI namespace to \root\Microsoft\SqlServer\ReportServer\<Instance>\v10, adding
properties that let you retrieve information about the SSRS edition and version, and removing
a couple of classes (such as MSReportManager_ConfigurationSetting and MSReportServer_Con­
figurationSettingForSharePoint). For more information about the SSRS WMI provider, see the SQL
Server Books Online entry at http://msdn2.microsoft.com/en­us/library/ms152836(SQL.100).aspx.

Summary
In this chapter, we looked at some advanced topics related to SSRS in SQL Server 2008. These
included adding custom .NET code to an SSRS report to improve performance for computa-
tionally intensive processes. We then looked at the new Word and improved Excel rendering
capabilities.

Next, we examined creating custom applications using the embeddable report controls avail-
able for both Windows Forms and Web Forms applications in .NET. We concluded the chap-
ter by discussing scalability, availability, and advanced memory management related to the
SSRS implementation for BI projects.
Chapter 23
Using Microsoft Excel 2007 as an
OLAP Cube Client
In this chapter, we look at the ins and outs of using Microsoft Office Excel 2007 as a user
client for Microsoft SQL Server 2008 Analysis Services OLAP cubes. We’ll take a look at the
functionality of the updated design of the PivotTable interface in Excel 2007. Although you
can use Excel 2003 as a client for OLAP cubes, we won’t cover that functionality here. We’ll
start by reviewing the installation process.

Using the Data Connection Wizard


As we introduced in Chapter 2, “Visualizing Business Intelligence Results,” you’ll need a
sample cube to work with to understand what you can and can’t do using an Excel 2007
PivotTable as a client interface to SQL Server 2008 SSAS OLAP cubes. We’ll use the Adventure
Works DW 2008 sample cube as the basis for our discussion in this chapter. In Chapter 2, we
detailed how and where to download and set up the sample OLAP cube. Remember that you
need to retrieve the sample files from www.CodePlex.com. If you haven’t done so already, set
up this sample prior to continuing on in this chapter (if you’d like to follow along by using
Excel 2007).

We’ll use Excel 2007 as a sample client for the duration of this chapter. You can use Excel
2003 as a client for SSAS 2008 OLAP cubes, but it does have a slightly different interface and
somewhat reduced functionality. For more detail on exactly what OLAP cube features are
supported when using Excel 2003, see SQL Server Books Online.

Excel 2007, like the entire 2007 Microsoft Office system, includes redesigned menus, which
are part of the Ribbon. To start, we’ll take a look at making the connection from Excel 2007
to our sample SSAS 2008 OLAP cube. To do this we’ll use the Data tab on the Ribbon (shown
in Figure 23-1). Note the two groups on this tab that relate to connection management: Get
External Data and Connections.

FigUre 23-1 The Data tab on the Excel 2007 Ribbon

671
672 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

To demonstrate the functionality of the Data tab on the Ribbon, we’ll take you through
an example. To make a connection to the AdventureWorks sample cube, click From Other
Sources in the Get External Data group. In the drop-down list that opens, click From Analysis
Services. The Data Connection Wizard opens, as shown in Figure 23-2.

FigUre 23-2 The Connect To Database Server page of the Data Connection Wizard

After typing the name of the SSAS instance to which you want to connect (localhost if you
are following along with the example), enter the login credentials that are appropriate for
your scenario. Remember that by default, only local administrators have permission to read
OLAP cube data. If you are configuring nonadministrative access, you first have to use SSMS
or BIDS to configure Windows role-based security (the preferred method of connecting).
Next, log in as the Windows user for whom you are creating the Excel-based connection.

On the next page of the Data Connection Wizard, shown in Figure 23-3, you are asked
to select the OLAP database name and then the structure to which you want to connect.
You can select only one object. Both regular cubes and Analysis Services perspectives are
supported as valid connection objects. You’ll recall from Chapter 8, “Refining Cubes and
Dimensions,” that an SSAS OLAP cube perspective is a named, defined subset of an existing
OLAP cube. Perspectives are often used to provide end users with simplified views of enter-
prise cubes. You’ll continue to use the AdventureWorks cube for this example.

On the last page of the Data Connection Wizard, shown in Figure 23-4, you can configure
additional properties, such as naming the connection. Notice the optional setting, which is
cleared by default, called Always Attempt To Use This File To Refresh Data. Remember that
Excel 2007 does not refresh the data retrieved from the OLAP cube automatically. Later, as
we take a look at the PivotTable settings, we’ll review how you can change this default to per-
form refreshes on demand.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 673

FigUre 23-3 The Select Database And Table page of the Data Connection Wizard

FigUre 23-4 The Save Data Connection Files And Finish page of the Data Connection Wizard

You need to balance the overhead in network traffic, query execution on Analysis Services,
and the traffic necessary for your users’ needs for refreshed data. You should base your con-
figuration on business requirements, and you should also document the refresh rate for other
administrators and end users.
674 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Working with the import Data Dialog Box


After you click Finish on the last page of the Data Connection Wizard, Excel opens the Import
Data dialog box (shown in Figure 23-5), which determines how and where the PivotTable will
be placed in the workbook. Here you can select the view for the incoming data. Your choices
are PivotTable Report, PivotChart And PivotTable Report (which includes a PivotTable), or
Only Create Connection (which doesn’t create a PivotTable). Also, you’ll specify whether you
want to put the data on the existing, open worksheet page or on a new worksheet.

FigUre 23-5 The Import Data dialog box

You could just click OK at this point and begin laying out your PivotTable. However, alterna-
tively you can click the Properties button to configure advanced connection information. If
you do that, a dialog box with two tabs opens. The Usage tab (shown in Figure 23-6) allows
you to configure many properties associated with this connection. As mentioned previ-
ously, an Excel PivotTable is set to never refresh data by default. To change this setting, either
enable timed refresh by selecting Refresh Every and then setting a value in minutes, or select
Refresh Data When Opening The File.

We call your attention to one other important setting in this dialog box—the OLAP Drill
Through setting, with a default of a maximum of 1,000 records. If your business require-
ments are such that you need to increase this number substantially, we recommend that you
test this under production load. Drillthrough is a memory-intensive application that requires
adequate resources on both the client and the server.

After you’ve completed configuring any values you want to change in the Connection
Properties dialog box, click OK to return to the Import Data dialog box. Click OK in that dia-
log box. Excel will create a couple of new items to help you as you design the PivotTable.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 675

FigUre 23-6 The Usage tab of the Connection Properties dialog box

Understanding the PivotTable interface


The Excel Ribbon gives you quick access to the most commonly performed tasks when work-
ing with a PivotTable—we’ll be exploring its functionality in more detail shortly. After you’ve
completed your connection to SSAS, the Ribbon displays the Options and Design tabs under
PivotTable Tools. You’ll generally use the Options tab (shown in Figure 23-7) first, so we’ll start
there.

FigUre 23-7 The PivotTable Tools Options tab

As we continue through the new interface, you’ll notice two additional sections. The first is
a workspace that is highlighted on the Excel worksheet page itself, as shown in Figure 23-8.
The point of this redesign, as we mentioned in Chapter 2, is to make the process of working
with a PivotTable more intuitive for minimally trained end users.
676 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 23-8 The work area for creating a PivotTable on a worksheet

Another important component of the PivotTable workspace is the redesigned PivotTable


Field List. This list now has four possible display configurations. Figure 23-9 shows the default
layout (Field Section And Areas Section Stacked). The button in the upper right allows you to
switch to the most natural layout for you.

The first section, labeled Show Fields Related To, allows you to filter measures, dimensions,
and other objects (such as KPIs) by their association to OLAP cube measure groups. The sec-
ond section, below the filter, allows you to add items to the PivotTable surface by selecting
them. The items are ordered as follows:

■■ Measures Shown alphabetized in order of measure groups.


■■ KPIs Shown in associated folders, then by KPI. You can select individual aspects of
KPIs to be displayed—either value, status, trend, or goal—rather than the entire KPI.
■■ Dimensions Shown alphabetized. Within each dimension you can select defined hier-
archies, individual members of hierarchies, or individual levels to be displayed. You can
also select named sets to be displayed.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 677

FigUre 23-9 The PivotTable Field List

Click any item in the PivotTable Field List and it is added to the PivotTable workspace and to
one of the areas at the bottom of the PivotTable Field List. Measures and KPIs are automati-
cally added to the Values section of the latter. Non-measures, such as dimensional hierar-
chies, are added to either the Row Labels or Column Labels section of this list. Column labels
are also placed by default on the rows axis of the PivotTable.

To pivot, or make a change to the PivotTable that you’ve designed, simply click and drag the
item you want to move from one axis to another on the designer surface. For example, you
can drag a dimensional hierarchy from the rows axis to the columns axis, or you can drag it
from the Row Labels section of the PivotTable Field List to the Column Labels section. If you
want to use a non-measure item as a report filter, you drag that item to the Report Filter sec-
tion of the field list. A filter icon then appears next to the item in the list of items. You can
also remove values from the PivotTable by dragging them out of any of the sections and
dropping them on the list of fields.
678 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Creating a Sample PivotTable


Now that we’ve explored the interface, let’s work with a sample in a bit more detail. To do
this we’ve first set the PivotTable Field List filter to Internet Sales (measure group). This gives
us a more manageable list of items to work with. We’ll look at two measures, Internet Sales
Amount and Internet Order Quantity, by selecting them. Selecting these measures adds both
of them to our PivotTable display (work) area. It also adds both measures to the Values area.

Next we’ll put some dimensional information on both rows and columns, adding the
Customer Geography hierarchy to the rows axis and the Date.Calendar hierarchy to the col-
umns axis as shown in Figure 23-10. Of course, we can also add a filter. We’ll do that next.

FigUre 23-10 The PivotTable Field List configured

Remember that if you want to filter one or more members of either dimension that we’ve
already added, you simply open the dimension listing either from the field list or from the
PivotTable itself and then clear the members that you want to remove from the display. After
you implement a filter, the drop-down triangle icon is replaced by a filter icon to provide
a visual indicator that the particular axis has been filtered. For our example we’ve removed
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 679

the Canada data member from the Customer Geography dimension by clearing the check
box next to that value (Canada) in the Pivot Table Field List for the Customer Geography
hierarchy.

Now our PivotTable looks a bit more interesting, as shown in Figure 23-11. We show two
measures and two dimensions, with the Customer Geography dimension being filtered. Be
aware that your end users can perform permitted actions, such as drillthrough, by simply
right-clicking the cell containing data of interest. Excel 2007 supports not only drillthrough,
but also additional actions defined on the OLAP cube. Remember that OLAP cube actions
can be of several types, including regular, reporting, or drillthrough. Regular actions target
particular areas of the cube, such as dimensions, levels, and so on, and produce different
types of output, such as URLs and datasets.

FigUre 23-11 A simple PivotTable

Adding a filter from another dimension is as simple as selecting the dimension of interest and
then dragging that item to the Report Filter section of the field list. As with other SSAS client
interfaces, this filter will appear at the top left of the PivotTable, as it does in the BIDS cube
browser. This filter allows you to “slice” the OLAP cube as needed.

Now that we’ve created a sample PivotTable, you can see that more buttons are active on
the PivotTable Tools Options and Design tabs. Using the Design tab (shown in Figure 23-12),
you can now format the PivotTable. You can apply predefined design styles and show or
hide subtotals, grand totals, empty cells, and more. Which items are enabled on the Ribbon
depends on where their focus is. For example, if a measure is selected, most of the grouping
options are disabled.

FigUre 23-12 The PivotTable Tools Design tab

You might want to add a PivotChart to your workbook as well. Doing so is simple. You
use the PivotTable Tools Options tab to add a PivotChart. First, click any cell in the exist-
ing PivotTable and then click the PivotChart button. This opens the Insert Chart dialog box
(shown in Figure 23-13) from which you select the chart type and style.
680 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 23-13 The Insert Chart dialog box

As with your PivotTable, the resultant PivotChart includes a redesigned PivotChart Filter
Pane. You use this to view and manipulate the values from the cube that you’ve chosen to
include in your chart. An example of both a PivotChart and the new filter pane is shown in
Figure 23-14.

FigUre 23-14 An Excel PivotChart based on an OLAP cube

After you’ve created a PivotChart, Excel adds a new set of tabs on the Ribbon, under
PivotChart Tools. These four tabs help you work with your PivotChart: Design (shown in
Figure 23-15), Layout, Format, and Analyze.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 681

FigUre 23-15 The PivotChart Tools Design tab

Offline OLAP
In all of our examples so far, we’ve been connected to an SSAS instance and we’ve retrieved
data from a particular OLAP cube via query execution by Excel. In some situations, the end
user might prefer to use Excel as a client for locally stored data that has been originally
sourced from an OLAP cube. This is particularly useful in scenarios where the user may be
at a remote location or travels extensively and thus cannot always have direct access to the
Analysis Services database. Excel includes a wizard that allows authorized end users to save a
local copy of the data retrieved from an OLAP cube. To use this functionality, click the OLAP
Tools button on the PivotTable Tools Options tab, and then click Offline OLAP. The Offline
OLAP Settings dialog box opens and lets you choose to work online or offline. On-Line OLAP
is selected by default, as shown in Figure 23-16. Click the Create Offline Data File button
to open the Offline OLAP Wizard, also called Create Cube File. This wizard will guide you
through the process of creating a local cube (*.cub) file.

FigUre 23-16 The Offline OLAP Settings dialog box

The first page of the wizard explains the local cube creation process. On the second page of
the wizard, you are presented with a list of all dimensions from the OLAP cube. Dimension
members that are currently selected to be shown in the PivotTable on the workbook page
are shown as selected and appear in bold in the list. The dimensions and levels you choose
here control which dimensions and levels are available in the offline copy of the cube. An
example is shown in Figure 23-17.
682 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigUre 23-17 Create Cube File – Step 2 Of 4

On the third page of the wizard, you are shown a summary of the dimension members for
your local cube. You can select or clear Complete Objects, which excludes all members of
that object, or you can remove individual members, levels, and so on from your selected
parent items. In Figure 23-18, our selection mirrors the earlier filter that we configured in
the main PivotTable. That is, we’ve included the Country dimension, but have filtered out the
Canada member from our view.

FigUre 23-18 Create Cube File – Step 3 Of 4

On the last page of this wizard, you configure the path and file name of where you’d like
Excel to create and store the local cube file. The default file path is C:\Users\%username%\
Documents\%cubename%.cub. The file that is created is stored in a format native to Excel.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 683

Note You will find that while most values are selected by default, a few dimensions may not be
part of your PivotTable at all (such as Destination Currency). This can cause the wizard to throw
an error when trying to save the offline cube, because Destination Currency has a many-to-many
relationship with Internet Sales, and all the appropriate intermediate values may not be selected
by default.

excel OLAP Functions


Excel 2007 exposes a group of new functions that allow you to work with OLAP cube infor-
mation via Excel formulas. These functions are listed in Table 23-1.

TABLe 23-1 excel OLAP Functions


Function Description
CUBEMEMBER(connection,member) Returns the member defined by
member_name
CUBEKPIMEMBER(connection,kpi_name,kpi_property) Returns the KPI property defined by
kpi_name
CUBEVALUE(connection,member1,member2, …) Returns the value of a tuple from the
cube
CUBESET(connection,set_expression) Returns the set defined by set_expression
CUBERANKEDMEMBER(connection,set_expression,rank) Returns the nth item from a set
CUBEMEMBERPROPERTY(connection,member,property) Returns a property of a cube member
CUBESETCOUNT(set) Returns the number of items in a set

These functions are used directly in the Excel formula bar and make retrieving values from an
OLAP cube by using custom formulas simpler. Although using these new Excel functions may
suffice for your particular business requirements, in some situations your requirements might
call for more advanced customization. We’ll address just how you can do that in the next
section.

extending excel
If your business requirements call for using Excel as a base client for your BI solution, but also
require you to extend Excel, you should take a look at using the Microsoft Visual Studio Tools
for the Microsoft Office System (commonly called VSTO). You can use this to extend Excel to
create custom output styles for PivotChart or PivotTable objects. Another reason to extend
Excel by using code is to overcome the built-in limits, such as 256 page fields for a PivotTable.
For more specifics on Excel data storage limits, see http://office.microsoft.com/en-us/excel/
HP051992911033.aspx.
684 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

The development templates for VSTO are included with Visual Studio 2008 Professional,
or the Team Editions of Visual Studio 2008. A free run-time download is available at
http://www.microsoft.com/downloads/details.aspx?FamilyID=54eb3a5a-0e52-40f9-a2d1-
eecd7a092dcb&DisplayLang=en.

If you have one of the full versions of Visual Studio 2008, it contains templates that allow you
to use the various 2007 Office system formats as a basis for application development. In our
current context, you’ll choose Excel 2007 Workbook. The Visual Studio 2008 New Project dia-
log box contains several templates for custom Excel programming. These include templates
for Excel add-ins, workbooks, and templates. Note that the VSTO templates are version-
specific, meaning that different versions of the development templates are available for Excel
2003 and for Excel 2007.

For example, if you select and open the Excel 2007 Workbook template, you are presented
with the familiar Excel workbook inside of Visual Studio 2008. There you can add any of the
common Windows Forms controls, as well as add .NET code to Excel as your business needs
require.

Office applications that you have extended programmatically are called Office Business
Applications, or OBAs. A developer resource center that includes the usual code samples,
training videos, and so on is available on MSDN at http://msdn.microsoft.com/en-us/office/
aa905533.aspx. An Excel 2007 developer portal is also available on MSDN at http://msdn.
microsoft.com/en-us/office/aa905411.aspx. Finally, you will also want to download the Excel
2007 XLL SDK from http://www.microsoft.com/downloads/details.aspx?FamilyId=5272E1D1-
93AB-4BD4-AF18-CB6BB487E1C4&displaylang=en.

To see a sample project created using VSTO, visit the CodePlex site named OLAP PivotTable
Extensions. You can find this project at http://www.codeplex.com/OlapPivotTableExtend. In
this project the developer has created an extension to Excel 2007 that allows authorized end
users to define private calculated members that are specific to their particular PivotTable ses-
sion instance by using a Windows Forms user interface. This extension project also contains a
custom library view for easier management of these newly added calculated members. This
is a good example of an elegant extension based on business need to Excel’s core PivotTable
functionality.
Chapter 23 Using Microsoft Excel 2007 as an OLAP Cube Client 685

Summary
In this chapter we investigated the integration between SQL Server 2008 SSAS OLAP cubes
and Excel 2007. We looked at the mechanics around PivotTable and PivotChart displays. We
then investigated the process for creating offline OLAP cubes.

We followed this by examining the new Excel functions, which allow direct OLAP cube
data retrieval from the Excel formula bar. We closed our discussion with a look at the setup
required to extend Excel programmatically.

In the next chapter we’ll continue to look at Excel as a SQL Server 2008 BI client, but not for
cubes. In that chapter we’ll look at how Excel is used as a client for SSAS data mining struc-
tures and models.
Chapter 24
Microsoft Office 2007
as a Data Mining Client
In this chapter we take a look at using the 2007 Microsoft Office system as an end-user cli-
ent for SSAS data mining objects. Here we’ll examine the ins and outs of using Microsoft
Office Excel 2007 and Microsoft Office Visio 2007 as end-user clients for SSAS data mining
structure and models. We’ll take a look at the functionality of the Microsoft SQL Server 2008
Data Mining Add-ins for Office 2007. As you may recall from Chapter 2, “Visualizing Business
Intelligence Results,” these add-ins enable both Excel 2007 and Visio 2007 to function as end-
user interfaces to SSAS data mining models. We’ll start by reviewing the installation process
for the Data Mining Add-ins.

Installing Data Mining Add-ins


As we introduced in Chapter 2, several products in the 2007 Office suite are designed to
work as SSAS data mining clients, including Excel 2007 and Visio 2007. For best perfor-
mance, Microsoft recommends installing SP1 for Office 2007 prior to setting up and using
either of these as data mining clients. After you’ve installed the Office service pack, then you
must download, install, and configure the free SQL Server 2008 Data Mining Add-ins for
Office 2007. You can download the add-ins at http://www.microsoft.com/downloads/details.
aspx?FamilyId=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en.

After you’ve downloaded and installed the add-ins, you’ll see several new items related
to them under the Microsoft SQL 2008 Data Mining Add-Ins item on your Start menu
(see Figure 24-1), including Data Mining Visio Template, Getting Started, Help And Docu-
mentation, Sample Excel Data, and Server Configuration Utility. Using the add-ins from within
Excel or Visio requires certain configuration information for an SSAS instance. So the best
next step is to open the Server Configuration Utility, which is a four-step wizard that guides
you through the process of configuring a connection from Excel and Visio to your particular
SSAS instance.

FIgure 24-1 Data Mining Add–ins for Office 2007

687
688 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

To use the Server Configuration Utility, open it from the Start menu. One the first page of the
wizard, you‘re asked to supply the name of the SSAS instance and the authentication type
you wish to use when connecting. The default authentication type is Windows Credentials.
On the next page of the wizard, you’re asked whether you wish to allow the creation of tem-
porary (or session) mining models. If you select this option, authorized end users can create
these models by invoking SSAS algorithms and populating models with data from their local
Excel workbooks. On the third page of the wizard, you are asked whether you’d like to create
a new SSAS database (or add the information to an existing database) to hold information
about authorized user names for the add-ins. If you select New, a new SSAS database (named
DMAddinsDB by default) is created on the configured SSAS instance. A single security role
with a name like ExcelAddins_Role_3_19_2008 5_59_27 PM is created, using the current date
and time. Local administrators are added to this role by default. Also by default, this role has
full control on all objects. As with any SSAS role, you can of course adjust role membership
and permissions on this role as necessary. On the last page of the wizard (shown in Figure
24-2), you specify whether you’ll allow users of the Data Mining Add-ins to create permanent
models on the SSAS instance.

FIgure 24-2 The last page of the Server Configuration Utility, where you set permanent object creation
permissions
Chapter 24 Microsoft Office 2007 as a Data Mining Client 689

Data Mining Integration with excel 2007


After you install the Data Mining Add-ins, the data mining integration functionality is
exposed in Excel and Visio as additions to the menus inside of each. In Excel, two new tabs
appear on the Ribbon: Table Tools Analyze and Data Mining. To learn more about the add-ins
we’ll open the included sample data file called DMAddins_SampleData.xlsx, which is located
(by default) at C:\Program Files\Microsoft SQL Server 2008 DM Add-Ins.

Note If you worked with the SQL Server 2005 Data Mining Add-ins for Office 2007 and are now
working with SQL Server 2008, you need to download and install the version of the add-ins that
is specific to SQL Server 2008. The 2008 edition of the add-ins has several additions to function-
ality. We review the changes to the add-ins in detail later in this chapter.

After you open Excel 2007, click the Data Mining tab on the Ribbon (Figure 24-3). The
add-ins add this tab permanently to Excel 2007. Notice that the Data Mining tab has seven
major groups: Data Preparation, Data Modeling, Accuracy And Validation, Model Usage,
Management, Connection, and Help. Each group includes one or more large buttons that
give you access to the functionality available in the group. In addition, some buttons include
a tiny downward-pointing triangle that indicates additional functionality available at a click of
that button.

FIgure 24-3 Data Mining tab on the Excel 2007 Ribbon

As we continue through this section, we’ll work through the functionality of the majority of
the buttons on the Data Mining tab of the Ribbon. However, we’ll first introduce the other
point of integration that the add-ins add to Excel 2007. To see this, you must select at least
one cell in an Excel table object. You’ll then see Table Tools on the Ribbon. Click the Analyze
tab to see the Table Analysis Tools, Connection, and Help groups, as shown in Figure 24-4.

FIgure 24-4 The Table Tools Analyze tab on the Excel 2007 Ribbon

Because these tools expose the simplest set of functionality, we’ll start our detailed look at
the integration between Excel 2007 and SSAS data mining by working with the Table Tools
690 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Analyze tab. We’ll continue our tour by looking in more detail at the functionality of the Data
Mining tab.

Using the Table Analysis Tools Group


Before we dive in and work with the Table Analysis Tools group, let’s take a minute to con-
sider what types of end users these tools have been built for. An effective way to do that is to
look at the help file included for the add-ins in general. Remember that the help for the add-
ins is separate from SQL Server Books Online—you access it via Excel 2007. Here’s an excerpt
from the SQL Server Books Online introduction:

The SQL Server 2008 Data Mining Add-ins for Office 2007 provides wizards and
tools that make it easier to extract meaningful information from data. For the user
who is already experienced with business analytics or data mining, these add-ins
also provide powerful, easy-to-use tools for working with mining models in Analysis
Services.

Glancing through the included help topics, you can see that intended user is an intermediate
Excel user, particularly someone who uses Excel for analysis tasks. We think of this user as a
business analyst. That is probably not surprising to you. What may be surprising, however, is
the end user we see as a secondary user of Table Analysis Tools and data mining functional-
ity. That end user is you, the technical BI developer. If you are new to BI in general or if you
are new to data mining, using the integration in Excel is a fantastic (and time-efficient) way to
understand the possibilities of data mining.

That said, you may eventually decide that you prefer to work in BIDS. However, we find the
combination of working in Excel and in BIDS to be the most effective. One other considera-
tion is that if you are thinking about creating a custom application that includes data mining
functionality, using the add-ins has two points of merit. First, you can see an example of an
effective data mining client application. Second, you can actually use what already exists in
Excel as a basis for a further customized application, by customizing or extending the included
functionality using .NET Framework programming and the Visual Studio Tools for Office
(VSTO). For more information about VSTO, go to http://msdn.microsoft.com/en-us/office/
aa905533.aspx.

To start using Table Analysis Tools, you must first configure a connection to an SSAS instance.
You might be surprised that you have to do this inside of Excel because you already did a
type of connection configuration during the initial setup. This second configuration allows
you to set permissions more granularly (more restrictively) than you set in the “master” con-
nection when you configured the add-ins setup. Configuring the user-specific setup is simple.
Click the Connection button on the tab and then click New in the dialog box. This opens a
familiar connection dialog box where you list the instance name and connection information,
as shown in Figure 24-5.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 691

FIgure 24-5 Connect To Analysis Services dialog box

Now, we’ve told you that you need a connection, but we haven’t yet shown you why. This
should become obvious shortly. At this point, suffice to say that from within Excel you’ll be
connecting to and using the data mining algorithms that are part of SSAS to analyze data
from your local Excel workbook.

Tip As we get started, you’ll notice that Excel has no profiling or tracing capability, so you can’t
natively see what exactly is generated on the SSAS instance. Of course, if you wanted to capture
the generated information, you could turn on SQL Server Profiler, as we’ve described in Chapter
6, “Understanding SSAS in SSMS and SQL Server Profiler.” It is interesting to note that the Data
Mining tab does contain a method of profiling the queries called Trace. You may want to turn it
on as we work through the capabilities of the Table Analysis Tools group.

Figure 24-6 shows the Table Analysis Tools group, which we’ll explore next. The first button in
the group is Analyze Key Influencers. True to the intended end-user audience, the names of
the tools in this group are expressed in nontechnical terms.

FIgure 24-6 The Table Analysis Tools group expresses data mining functionality in nontechnical terms.

This language continues as you actually begin to use the functionality. For example, when
you select the Table Analysis Tools Sample worksheet in the sample workbook and click the
Analyze Key Influencers button, a dialog box opens that presents you with a single choice—
which column is to be analyzed—and a link to more advanced functionality (Add/Remove
Considered Columns). The dialog box optionally allows you to continue to analyze the results
692 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

by adding a report that shows how the influencers are discriminated (selected). The Analyze
Key Influencers dialog box is shown in Figure 24-7.

FIgure 24-7 The Analyze Key Influencers dialog box

The outcome of this dialog box is an easy-to-understand workbook page that shows the key
influencers ranked by level of influence for the selected column and its possible states. In our
example, bike buyer can be either 1 or 0 (yes or no). It’s really that simple. Figure 24-8 shows
the resulting output table in Excel.

FIgure 24-8 Using the Analyze Key Influencers button produces a table output in Excel.

It’s important to understand what just happened here. If you used the Tracer tool, you can
see that a temporary data mining model was created and was then populated (or trained)
using the Excel table (spreadsheet) as source data. Closer examination of either Tracer or
SQL Server Profiler output shows that Analyze Key Influencers performed the following steps:

1. Created a temporary mining structure, marking all source columns as discrete or


discretized
Chapter 24 Microsoft Office 2007 as a Data Mining Client 693

2. Added a mining model to the structure using the Microsoft Naïve Bayes algorithm
3. Trained the model with the Excel data
4. Retrieved metadata and data from the processed model

The following detailed DMX statements were produced:

CREATE SESSION MINING STRUCTURE [Table2_572384]


([__RowIndex] LONG KEY,[ID] Long Discretized,
[Marital Status] Text Discrete, [Gender] Text Discrete,
[Income] Long Discretized, [Children] Long Discrete,
[Education] Text Discrete, [Occupation] Text Discrete,
[Home Owner] Text Discrete, [Cars] Long Discrete,
[Commute Distance] Text Discrete, [Region] Text Discrete,
[Age] Long Discretized, [Purchased Bike] Text Discrete)

ALTER MINING STRUCTURE [Table2_572384] ADD SESSION MINING MODEL


[Table2_572384_NB_234496]([__RowIndex], [ID], [Marital Status], [Gender],
[Income], [Children], [Education], [Occupation], [Home Owner], [Cars],
[Commute Distance], [Region], [Age], [Purchased Bike] PREDICT)
USING Microsoft_Naive_Bayes(MINIMUM_DEPENDENCY_PROBABILITY=0.001)

INSERT INTO MINING STRUCTURE [Table2_572384] ([ID], [Marital Status],


[Gender], [Income], [Children], [Education], [Occupation], [Home Owner],
[Cars], [Commute Distance], [Region], [Age], [Purchased Bike]) @ParamTable

CALL System.GetPredictableAttributes('Table2_572384_NB_234496')

CALL System.GetAttributeValues('Table2_572384_NB_234496', '10000000c')

CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c',


'', 0, '', 2, 0.0, true)

CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c',


'No', 1, '', 2, 0.0, true)

CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c',


'Yes', 1, '', 2, 0.0, true)

SELECT FLATTENED(SELECT [SUPPORT] FROM NODE_DISTRIBUTION


WHERE ATTRIBUTE_NAME='Purchased Bike' AND VALUETYPE=1)
FROM [Table2_572384_NB_234496].CONTENT where NODE_TYPE=26

CALL System.GetAttributeValues('Table2_572384_NB_234496', '10000000c')

Now that you understand what is happening, you should work your way through the rest
of the buttons on the Table Tools Analyze tab. You’ll see that the Detect Categories button
uses the Microsoft Clustering algorithm to create a temporary mining model that results in
a workbook page that groups your input data (from one or more columns) into clusters or
categories and shows the results both statistically (through a row count of category values)
and via a stacked bar graph. You can use the Detect Categories button to quickly group data
694 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

from an Excel workbook source into buckets, or groups. Using this functionality can help you
to understand your data and correlations between attributes in your data more quickly. For
example, you can create groups (or ranges) of income values in a source dataset that includes
a large number of other attributes, such as age, number of children, home ownership, and
so on.

The Fill From Example button asks you to select an example column, then produces output
workbook pages that show the values over time that you can use for extending your series.
You can use this functionality to quickly fill in a series on a source workbook. Such series can
include time, financial results, age, and so on. A look at the output in the Tracer tool shows
that this functionality uses the Microsoft Logistic Regression algorithm.

The Forecast button requires a bit more configuration in its associated dialog box, shown in
Figure 24-9. To use it you select the column (or columns) that you wish to forecast. Only col-
umns with source data of types that can be used as input to the associated algorithm appear
available from the source workbook in this dialog box. The next step is to confirm or adjust
the number of time units that you wish to forecast. The default value is five units. In the
Options section, you can set the source columns for the time stamp. Finally, you can set the
periodicity of the data.

Note Your screen may say Income rather than Yearly Income, depending on which version of
the Excel sample workbook for data mining (called DMAddins_SampleData.xlsx) you are using.

FIgure 24-9 Using the Forecast button creates a mining model using the Microsoft Time Series algorithm.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 695

Did you guess which algorithm was used here? It’s pretty obvious, isn’t it? It’s the Microsoft
Time Series. You may have noticed that our sample data does not contain any temporal
information. We did this on purpose to illustrate a point. Although the Table Analysis Tools
may be straightforward to use, you must still observe some common-sense practices. You
still must select the correct tool for the job at hand. Although this produces a result, that
result may not be meaningful. Also, the overhead to produce this result can be pretty steep,
because the algorithm has to “fake” a time column. Of course, fake data is rarely meaningful,
so although the algorithm runs even if you use source data without time values, we advise
against doing this. If you are wondering how this is done, again, use the Tracer tool. We did,
and here’s the key line:

CREATE SESSION MINING STRUCTURE [Table2_500896] ([Income] Long CONTINUOUS,[Age] Long


CONTINUOUS,[__RowIndex]LONG KEY TIME)

Note the addition of the _RowIndex column with type LONG KEY TIME. Although under-
standing data mining concepts is not really needed to use the Table Analysis Tools, this idea
of including a time-based source column is important when using the Forecasting functional-
ity. As mentioned earlier, it is important to include data with an included time-series value
when you run the Forecast tool from the Table Analysis Tools.

The next button, Highlight Exceptions, just asks you to select one or more of the source col-
umns for examination. This button, like those reviewed so far, creates a mining structure and
model. Of course, it also runs a DMX query. Notice that the DMX Predict, PredictVariance, and
PredictCaseLikelihood functions are used here. These functions generate a result (a new work-
book in Excel) that allows you to quickly see the exception cases for your selected data. These
exceptions are sometimes called outliers. Understanding this can help you to judge the qual-
ity of the source data—in other words, the greater the quantity of outliers, the poorer the
quality of data. The particular DMX prediction function invoked when you use the Highlight
Exceptions functionality depends on the source data type selected for examination.

The text of that generated query looks like this:

SELECT T.[Income], Predict([Income]), PredictVariance([Income]),


PredictCaseLikelihood()
FROM [Table2_372960_CL_463248] NATURAL PREDICTION JOIN @ParamTable as T
ParamTable = Microsoft.SqlServer.DataMining.Office.Excel.ExcelDataReader

Note The input is read from Excel using an ExcelDataReader object. If you were to write a cus-
tom data mining client, this is the object and the library that you would work with to do so.

The next button, Scenario Analysis, allows you to perform goal-seeking or what-if scenarios
on one column and one or more rows of data. A goal-seeking scenario generates a DMX
query that uses the PredictStdDev function. A what-if scenario generates a DMX query that
uses the PredictProbability function. The output from both queries is shown at the bottom of
696 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

the configuration dialog box. The output is assigned a confidence score as well. Figure 24-10
shows the output from using the What-If Scenario Analysis. In this example we’ve configured
the scenario change (input) column to use Income. We’ve asked to analyze the impact on the
target column named Commute Distance. For expediency, we’ve asked to perform analysis
on a single row from the source table. You can see that the output of the query predicts that
an increase in yearly income correlates with a commute distance of 0-1 miles. The confidence
of the result is ranked from poor to very good as well as color-coded and bar-graphed to
show the strength of that resulting confidence.

FIgure 24-10 The What-If Scenario Analysis output includes a confidence ranking.

Because of the popularity of the Table Analysis Tools group, Microsoft has added two new
buttons, Prediction Calculator and Shopping Basket Analysis, to the Table Tools Analyze tab
in the SQL Server 2008 Data Mining Add-ins. The Prediction Calculator requires a bit of con-
figuration before it can be used, as shown in Figure 24-11. The first value you must configure
is Target, which represents the column and either an exact value or a range of values from
that column’s data for which you want to detect prediction patterns. Because this column
can contain multiple values and you want to restrict the values being predicted for, you can
use the Exactly or In Range options. Of course, you can only select values in range for input
column values that could be considered continuous, such as income. Two optional values,
Operational Calculator and Printer-Ready Calculator, are selected by default. These values are
produced in addition to the output so that a user can input values and see the variance (or
Chapter 24 Microsoft Office 2007 as a Data Mining Client 697

cost) of the changes to the mode. For our example, we set the Target column to Education
and the value to Bachelors.

FIgure 24-11 The Prediction Calculator includes an Operational Calculator.

This tool uses the Microsoft Logistic Regression algorithm, and marks the Target value with
the Predict attribute. The output produced is a dynamic worksheet that allows you to tinker
with the positive or negative cost or profit. Literally, this helps you to understand possible
potential profit (or cost) from making a correct (or incorrect) prediction.

The worksheet consists of four parts. The first section is a small calculation section where you
can adjust the values associated with positive or negative cost or profit. The other three sec-
tions show the point of reaching profitability based on the input calculations. As you adjust
the costs associated with the targeted value upward or downward, you can immediately see
the projected profit score threshold and cumulative misclassification values change on the
linked charts. Because we left the default values of Operational Calculator and Printer-Ready
Calculator selected as well, we also get two additional workbook pages that can be used
interactively or printed and used by end users who wish to fill out (and score) the results
manually. These calculators assign scores (or weights) to each of the attribute values. Adding
the scores for selected attributes produces the likelihood results for the target attribute. The
results of this report are shown in Figure 24-12.
698 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FIgure 24-12 The Prediction Calculator allows you to test different values associated with costs and profits.

The Shopping Basket Analysis button uses the Microsoft Association algorithm and produces
several reports that can help you to understand which items sell best together. It also sug-
gests which items sold together would increase your profits the most. To use this button,
you’ll want to use data from the sample worksheet on the Associate tab. This is because this
data is in a format expected by this tool, meaning that it contains an order number (key),
category (optional), product name, and price. You can see from the configuration dialog
box that these values are called Transaction ID, Item, and Item Value. Although Item Value is
optional, the tool produces much more meaningful results if this information is available.

The Advanced (Configuration) link lets you adjust the default values for Minimum Support
(set for 10) and for Minimum Probability Rule (set for 40). The former means the minimum
number of transactions used to create rules; the latter relates to strength of correlations
required to create rules. As you may remember from Chapter 13, “Implementing Data
Mining Structures,” and Chapter 14, “Architectural Components of Microsoft SQL Server 2008
Integration Services,” the Microsoft Association algorithm creates rules and then shows item-
sets, or groups of items that generate increasingly greater total return values. You can view
the Shopping Basket Analysis tool in Figure 24-13.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 699

FIgure 24-13 The Shopping Basket Analysis tool uses the Microsoft Association algorithm.

The output from this tool is two new workbook pages, the Shopping Basket Bundled Items
report and the Shopping Basket Recommendation report. The first report shows you bundled
items (minimum of two items per bundle), number of items in the bundle, number of sales
of this bundle, average value per bundle, and overall sales total of bundled items, as you can
see in Figure 24-14. This report is sorted by overall sales total, but you can, of course, re-sort
it to your liking. The second report shows you which items when bundled together result in
increased sales and suggests items to cross-sell.

FIgure 24-14 The Shopping Basket Bundled Items report shows which items have sold together.
700 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

The first column lists a particular item; this is followed by the top recommendation for cross-
selling with that particular item. The table also shows counts, percentages, and estimated sale
values of cross-selling the listed items. As with the Shopping Basket Bundled Items report,
this report is also sorted by overall value (or potential profit) dollars by default.

As we complete our tour of the Table Tools Analyze tab, you should remember a couple of
points. This tab’s primary purpose is to expose the functionality of some of the SSAS data
mining algorithms to users who wish to use Excel table data as source data. Note that we did
not access, create, update, or in any other way work with permanent data mining models on
the SSAS server. For tasks such as those we’ll look next to the other data mining tool available
in Excel: the Data Mining tab of the Ribbon.

Using the Data Mining Tab in Excel 2007


Before we start working with the Data Mining tab, let’s take a minute to understand the
conceptual differences between the tools and functionality it exposes as compared to those
available on the Table Tools Analyze tab of the Ribbon. Refer again to Figure 24-3, the Data
Mining tab, as we discuss its design. Take a look at the tab’s groups: Data Preparation, Data
Modeling, Accuracy And Validation, Model Usage, Management, Connection, and Help. It
is interesting to note that these group names roughly correspond to the CRISP-DM SDLC
phases that we discussed in Chapter 13. These names might seem odd because Excel is typi-
cally an end-user client. In the case of data mining, Excel is also a lightweight development
tool. This type of implementation is quite new in Microsoft’s product suite—using an Office
product as an administrative or a developer interface to a server-based product has rarely
happened.

It is quite important that you understand that the Data Mining tab (interface) is designed for
these two types of end users: business analysts and BI administrators. Of course, in smaller
organizations it is quite possible that these functions could be performed by the same
person—in fact, that is something that we’ve commonly seen with clients who are new to
data mining. Also, as we’ve mentioned, for those of you application developers who are com-
pletely new to data mining, using the Data Mining tools, rather than BIDS itself, to develop,
query, and manage data mining models on your SSAS instance may prove to be more pro-
ductive for you.

Unlike the Table Analysis Tools group, the Data Mining tab functionality can generally (but
not always) interact with data that is stored either in the local Excel workbook or on the
SSAS server instance. The two tab interfaces have some commonality in the access to the
Data Mining Add-in–specific help file and access to the required Connection configuration
object. As with the Table Analysis Tools group, using any of the tools exposed on the Data
Mining tab requires an active connection to an SSAS instance because mining algorithms are
being used to process the data in the request. We mentioned the use of the included Tracer
tool when we discussed the Table Analysis Tools group. Tracer captures and displays activity
Chapter 24 Microsoft Office 2007 as a Data Mining Client 701

generated on the SSAS instance from the use of any of the Data Mining tools in Excel. Figure
24-15 shows some sample output from the Tracer tool. As we did when examining the Table
Analysis Tools group, we’ll use Tracer when working with the tools on the Data Mining tab so
that we can better understand what the tools are actually doing.

FIgure 24-15 The Tracer tool shows generated activity on the SSAS server from the
Data Mining Add-ins in Excel.

We’ll now take a look at the remaining groups and contained functionality of the tab—
those that are unique to this particular tab: Data Preparation, Data Modeling, Accuracy And
Validation, Model Usage, and Management. We’ll start by taking a look at the some of the
administrative capabilities built into the Data Mining tab. These capabilities are found in the
Management and Model Usage groups on the tab.

Management and Model Usage


The Management group of the tab has one button, Manage Models. This button allows
authorized Excel users to perform a number of functions on existing models that are stored
in the configured SSAS instance. After you click this button, a dialog box opens (Figure 24-16)
that lists all mining structures and their contained models on the left side. On the right side,
you’ll see a list of actions that can be performed on a selected model or structure. Possible
actions for structures include renaming, deleting, clearing data, processing, exporting exist-
ing metadata, and importing new metadata. Possible actions for models include renaming,
deleting, clearing data, processing, and exporting or importing existing metadata. If you
choose to export metadata, you must specify the export file destination location. The type of
output file produced is a native application format, meaning that it is not XMLA. Instead it is
an SSAS backup file that can be restored to an SSAS instance using SQL Server Management
Studio.
702 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FIgure 24-16 The Manage Models tool exposes administrative functionality through Excel for SSAS data
mining objects.

On the bottom right of this dialog box, metadata about the selected structure or object is
shown. You may be wondering whether the Data Mining tab buttons are security-trimmed.
Security trimming is a feature that removes menu items or tools when the currently active
user does not have permission to perform the actions associated with them. In the case
of the Data Mining tab, the tools are not security-trimmed, which means that end users in
Excel can see all the tools, but may be connecting to an SSAS instance using credentials that
are authorized to perform a limited subset of available options. If a user tries to perform an
action for which she is not authorized, that action fails and a dialog box alerts the user that
the action has failed.

The next button, Document Model, appears in the Model Usage group and has been newly
added for SQL Server SSAS 2008. When you click this button, the Document Model Wizard
opens. The first page of the wizard describes what the tool does. On the second page of the
wizard, you are presented with a list of all mining structures and models on the SSAS server
instance. You click the mining model that you are interested in documenting, and on the next
page of the wizard you choose the amount of detail you’d like to see in the documentation
(complete or summary). The wizard produces a new workbook page with the selected mining
model’s characteristics documented. Figure 24-17 shows a partial example output using the
Customer Clusters model that is part of the Adventure Works DW 2008 sample. Notice that it
contains the following information: Model information, Mining Model column metadata, and
Algorithm parameter configuration values. You may want this information for backup and
restore (general-maintenance) purposes.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 703

FIgure 24-17 The new Document Model tool provides a quick and easy way to document model settings.

The Browse button, which is closely related in functionality to the Document Model button,
is located on the Model Usage section of the tab. As with the tools we’ve looked at so far,
when you click this button you are first presented with a list of mining structures and models
located on the SSAS instance. You then select the model of interest and Excel presents you
with the same model viewers that we already saw in BIDS. Remember from our earlier discus-
sion of those mining model viewers that each of the nine included data mining algorithms is
associated with one or more viewers.

It is interesting to note that Excel’s Browse capability has a useful addition to some of the
viewers. At the bottom left of the dialog box that displays the viewer (for most viewers), you’ll
see a Copy To Excel button. The results of clicking this button vary depending on what type
of viewer you are looking at. In some cases the interactive viewer is rendered as a graphic in
a new worksheet. In other cases the information from the viewer is loaded into a new work-
sheet as Excel data and auto-formatted. Figure 24-18 shows a portion of example output of
the latter case. We used the Customer Clusters sample mining model and copied the Cluster
Profiles viewer data to Excel.

We’ve found integration capabilities such as this one to be very useful and powerful.
Remember that when the viewer’s Copy To Excel action dumps the source data into a new
workbook, the end user can manipulate that data using any of Excel’s functionality, such as
sort, filter, format, and so on.
704 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FIgure 24-18 The Copy To Excel feature used with Cluster characteristics produces a new workbook page.

The last button in the Model Usage group of the Data Mining tab is the Query button. This
button gives authorized end users the ability to build DMX prediction queries by using a
guided wizard. When you click the Query button, the wizard presents you with a series of
pages. These pages include tool explanation, model selection, source data (which can be
from the open Excel workbook or from any data source that has been configured on SSAS),
model and input column mapping, output selection, and query result destination. This tool
functions similarly to the Mining Model Prediction tab in BIDS in that it is designed to help
you write and execute DMX prediction queries. The key difference in the version available on
the Data Mining tab of the Ribbon in Excel is that you can define input tables using data from
an Excel workbook as an alternative to using SSAS source data for that purpose.

An advanced query builder is also available from within the wizard. You access this query
builder by clicking the Advanced button on the lower left of the column or output mapping
page. The Data Mining Advanced Query Editor is shown in Figure 24-19. Note that in addi-
tion to being able to directly edit the query, this tool also includes DMX templates, as well as
quick access to the other wizard input pages, such as Choose Model.

You can click parameter values, which are indicated by angle brackets (such as <Add Output>
in Figure 24-19) and Excel opens a dialog box that allows you to quickly configure the needed
value. For example, if you click <Add Output> (shown in Figure 24-19), the Add Output dia-
log box opens, where you can quickly complete those values. After you click OK, you are
returned to the Data Mining Advanced Query Editor to continue editing the entire query.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 705

FIgure 24-19 The Data Mining Advanced Query Editor

The tools that we’ve covered so far are quite powerful and are designed for advanced ana-
lysts as well as BI administrators and developers. These tools are designed primarily to allow
Excel users to work with models on the SSAS server, although there are exceptions. The next
group of tools we’ll review has a different focus in that it is primarily designed to facilitate
quick evaluation of Excel data using SSAS data mining algorithms.

Data Preparation Group


The first group on the Data Mining tab is called Data Preparation. It includes the Explore
Data, Clean Data, and Sample Data buttons. Generally these three buttons are designed to
expose data mining functionality to locally stored data (meaning in the Excel workbook).
You can think of the functionality as a kind of an SSIS-light look at Excel data. Clicking any
of these buttons opens a descriptive wizard page. The second page of the wizard allows you
to select the data you’d like to work with. For exploring and cleaning data you are limited to
working with Excel data. Your selection options are Table or Range. For Sample Data you may
select Excel data or data from a data source configured in SSAS.

These tools allow a quick review, simple cleaning, or sampling of source data. To use Explore
Data, simply click the button to start the wizard. After you select your data source, you select
a single column for analysis. Sample output based on the Education column in the Table
Analysis Tools Sample Table 2 is shown in Figure 24-20. This tool counts discrete data values
in the selected columns and graphs the output. A bar graph is the default output. You can
toggle between charting the discrete values and looking at continuous numeric data by
706 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

clicking the two small chart buttons on the bottom left of the output page. The column you
are exploring must be numeric if you want to use the numeric view.

FIgure 24-20 The Explore Data Wizard automatically “buckets” your data using a clustering algorithm.

The next button, Clean Data, includes two functions, Outliers and Re-label. Click Outliers,
click Next on the tool description page, and then select the data you wish to clean. You then
select the column you wish to analyze, and Excel presents you with output that shows the
values from the selected column graphed. You can adjust the expected minimum or maxi-
mum values with the slider controls on the Specify Thresholds page. Values that fall outside
of this range are considered outliers. Figure 24-21 shows a sample from the Source Data
worksheet, Source Data table, and Age column. As with the Explore Data output page, you
can toggle the graph type between discrete and numeric displays by clicking the small but-
ton on the bottom left corner of the Specify Thresholds page.

After you define the outliers and click Next, on the next page of the wizard you specify what
you would like done with the values that you’ve indicated are outliers. You have four choices:

■■ Change Value To Specified Limits (the default setting)


■■ Change Value To Mean
■■ Change Value To Null
■■ Delete Rows Containing Outliers
Chapter 24 Microsoft Office 2007 as a Data Mining Client 707

FIgure 24-21 The Outliers Wizard allows you to set the acceptable range for values.

After you specify your preference for outlier handling, on the last page you specify where
you’d like to put the modified data. You can add it as a new column in the current workbook,
place it in a new workbook, or change the data in place.

The Re-label function allows you to specify alternate values for existing column values. The
wizard presents a similar series of pages, including one where you list the new labels and one
where you specify where the new output should be placed. This is quick way to update col-
umn values to make them consistent.

The last button, Sample Data, allows you to select either source data from Excel or from
a data source defined on your SSAS instance and then create a sample set of data from it.
Remember that a common technique in data mining is to use one set of data for training and
another set of data for validation. Remember also that a new feature of SSAS 2008 is to spec-
ify partition values during model creation (supported for most algorithms) using BIDS, so we
see this function in Excel being used mostly on Excel source data. After selecting your data
source, on the next page of the wizard you are asked to specify the sampling method. You
have two choices: random sampling (which is the default) or oversampling (which allows you
to specify a particular data distribution). For example, you can use oversampling to ensure
that your sample data includes equal numbers of car owners and non–car owners, even if
the source data does not reflect that distribution. If you choose random sampling, on the
next page of the wizard you specify the size of the sample by percentage or row count. The
default is 70%. On the last page of the wizard, you specify the output location.

If you choose oversampling as the sampling method, on the next page of the wizard you spec-
ify the input column, target state, target percentage, and sample size, as shown in Figure 24-22.
708 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FIgure 24-22 The Sample Data Wizard

After you have reviewed your source data, cleaned it, and created samples from it, you may
want to use one or more data mining algorithms to help you better understand that data.
The next section of the tab, Data Modeling, allows you to do just that.

Data Modeling Group


It is important to understand that a global configuration setting controls an important
behavior of all of the tools available in the Data Modeling group on the Data Mining tab.
That setting concerns mining model creation. By default, when you set up the initial configu-
ration and connection to the SSAS server using the Data Mining Add-ins Server Configuration
Utility, you can specify whether you want to allow the creation of temporary mining models.
If you’ve left this default setting selected, using the mining model tools from this group on
the tab creates temporary (or session) mining models. These models are available only to
the user who created them and only during that user’s session. If you disable the creation
of temporary mining models, using any of the tools in this group creates permanent mining
models on your SSAS instance. The Data Modeling group on the Data Mining tab is shown in
Figure 24-23.

FIgure 24-23 The Data Modeling group on the Data Mining tab in Excel
Chapter 24 Microsoft Office 2007 as a Data Mining Client 709

Needless to say, appropriate configuration of the model creation location is quite impor-
tant. Consider carefully which situation better suits your business needs when planning your
implementation of data mining and using Excel 2007 as a client.

All of the tools in this group function similarly in that they create a data mining model (either
temporary or permanent) using source data and values you specify as you work through the
wizard associated with each tool. The buttons map to the original algorithms like so:

■■ Classify uses Microsoft Decision Trees.


■■ Estimate uses Microsoft Decision Trees in regression mode (auto-detects a regressor).
■■ Cluster uses Microsoft Clustering.
■■ Associate uses Microsoft Association.
■■ Forecast uses Microsoft Time Series.
■■ Advanced allows you to select any algorithm using the name it has in BIDS.

Because we’ve spent so much time in previous chapters reviewing the algorithms in detail,
we won’t repeat the process here. We’ll just take you through one example so that you can
get a feel for how mining model creation works in Excel. We’ll use the workbook page named
Associate from the Excel sample data. As we work through the wizard, notice that required
values are presented on the Association page, as shown in Figure 24-24.

FIgure 24-24 The Association page exposes required configurable property values such as Transaction ID.
710 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

You may recall a similar tool on the Table Tools Analyze tab called Shopping Basket Analysis.
If you look closely at the wizard page in Figure 24-24 and compare it with the one associated
with the Shopping Basket Analysis tool (shown previously in Figure 24-13), you can see that
the only difference is that some of the advanced parameter values, such as thresholds for
support, are shown on this page.

As with the Shopping Basket Analysis tool, to use the Associate tool you need to select a
column to associate with the transaction ID and also one for the item. It is interesting also to
note that one value that was configurable on the Shopping Basket Analysis page (item price)
is missing here. You may remember that the Shopping Basket Analysis tool is new for SQL
Server 2008. Its configurable parameters probably represent the most common customer
requests.

Although you could build your model after configuring only the parameters displayed on this
page, you have access to additional parameters when using the tools on the Data Mining tab.
If you click the Parameters button on the bottom left of the Association page of the wizard,
an Algorithm Parameters dialog box opens, as shown in Figure 24-25. As we’ve discussed,
these advanced parameters vary widely by source algorithm. Although some parameters
are fairly self-explanatory, such as MAXIMUM_ITEMSET_COUNT in the following example,
we consult SQL Server Books Online when we work with less common manually configured
parameters.

FIgure 24-25 The Algorithm Parameters dialog box presents advanced parameters that are specific to each
algorithm.

After you complete your configuration of all exposed parameters, the Associate Wizard pre-
sents you with a page containing metadata about the mining structure that you are about to
create. A sample output is shown in Figure 24-26. You can see that in addition to suggested
structure and model names and descriptions, you have a couple of options to configure on
this last page.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 711

FIgure 24-26 The final page of the Associate Wizard presents details of what will be created for the mining
structures.

The default—Browse Model—is shown in Figure 24-26. You may also choose to create a tem-
porary model, rather than a permanent one, and whether to enable drillthrough. Remember
from our previous discussion on drillthrough that not all algorithms support drillthrough.
Remember also that a new feature of SQL Server 2008 is to allow drillthrough to all columns
included in the mining structure, even if the columns of interest are not part of the mining
model.

If you leave the Browse Model check box selected on the Finish page, Excel displays your
model using the associated viewers for the particular algorithm that you used to build it. As
we discussed when reviewing the functionality of the Browse button on the Ribbon, what is
unique about the display of the viewers in Excel, rather than BIDS, is that the viewer includes
the ability to copy to Excel. The copy is executed in one of two ways. (The method used var-
ies by algorithm viewer.) The first method is to embed an image file of the viewer in a new
workbook page. The second method is to dump the data (and often to apply Excel’s auto-
formatting to that data) into a new workbook page.

Now that you have an understanding of what you can do using the Data Modeling group
on the Data Mining tab, you might be wondering in what business situations you should use
BIDS to create models rather than Excel. The answer is practical: Use the tool that seems the
most natural to you. You also want to consider which users, if any, you’ll grant model creation
permission to. If you do so, you also need to consider whether you’ll allow these users to
create only temporary (session) models, only server-based models, or some combination of
the two.
712 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

We generally advocate involving power data miners (usually business analysts) as Excel users
of SSAS data mining because most of them are already Excel power users. They can quickly
and easily become productive with a familiar tool. We have also had some success getting
technical professionals (whether they are DBAs or developers) quickly up to speed with data
mining capabilities by using the Excel add-ins rather than BIDS.

As with model building in BIDS, after you create a model in Excel, a best practice is to vali-
date it. To that end, we’ll next look at the group on the Data Mining tab of the Excel Ribbon
where those tools are found—Accuracy And Validation.

The Accuracy And Validation Group


As with the previous group, in the Accuracy And Validation group you find tools that
expose functionality that is familiar to you already because you’ve seen it in BIDS. The
Mining Accuracy tab in BIDS contains nearly identical functionality to the Accuracy Chart,
Classification Matrix, Profit Chart, and (new for SQL Server 2008) the Cross-Validation tools.

As in the previous section, we’ll take a look at just one tool to give you an idea of using this
functionality in Excel rather than in BIDS. We’ll use the Cross-Validation tool for our discus-
sion. When you click Cross-Validation, a wizard opens where you select your model of inter-
est (remembering that some algorithms are not supported for cross-validation). We’ll use the
Targeted Mailing structure from the Adventure Works DW 2008 sample and then select the
TM Decision Trees model from it. On the Specify Cross-Validation Parameters page, shown in
Figure 24-27, we’ll leave Fold Count set to the default of 10. We’ll also leave Maximum Rows
set to the default of 0. We’ll change Target Attribute to Bike Buyer and we’ll set Target State
to 1.

We find that Cross-Validation is quite resource-intensive. As we mentioned in Chapter 13, it


is not the type of validation that you will choose to use for all of your models. Instead, you
may want to use it when you haven’t partitioned any source data. Cross-Validation, of course,
doesn’t require a test dataset because it creates multiple training sets dynamically during the
validation process. Cross-Validation produces a report in a new workbook page that contains
similar information to that produced when using Cross-Validation in BIDS. A portion of the
output produced is shown in Figure 24-28.

We are nearly, but not yet, done with our exploration of data mining integration with Office
2007. One additional point of integration remains to review—integration between SSAS data
mining and Visio 2007. This functionality is included as part of the SQL Server 2008 Data
Mining Add-ins for Office 2007.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 713

FIgure 24-27 The Specify Cross-Validation Parameters page

FIgure 24-28 The Cross-Validation output in Excel is similar to that produced in BIDS.
714 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Data Mining Integration in Visio 2007


To understand the integration between SSAS 2008 data mining and Visio 2007, click Data
Mining Visio Template under the SQL Server 2008 Data Mining Add-ins option on the Start
menu. This opens the template that is designed to allow you to create custom visualizations
of the results of the three included data mining algorithm views. Note that the add-ins add
this template and new menu items to Visio 2007. The options available on the new Data
Mining menu are Manage Connections, Insert Decision Tree, Insert Dependency Net, Insert
Cluster, Trace, and Help. You can click the Trace item to open the Tracer tool (which is identi-
cal to the one available in Excel) so that you can see the query text that Visio generates and
sends to SSAS when you use any of the integration capabilities.

To start working with Visio’s integration, click Manage Connections on the Data Mining menu
and configure a connection to the Adventure Works DW 2008 sample. The next step is to
either select one or more mining model views from the menu for insertion on your working
diagram, or drag one or more shapes from the Microsoft Data Mining Shapes stencil (which
contains the same three model views). After you perform either of these actions, a wizard
opens. The first page of the wizard describes the purpose of the particular data mining
algorithm view in nontechnical terms. The second page of the wizard asks you to select the
connection to SSAS to use to retrieve the model information. The third page of the wizard,
(shown in Figure 24-29) lists the available mining structures and models on the SSAS instance
to which you’ve connected. Available means data mining models that were constructed using
algorithms that use the particular view type that you’ve selected. In our example, this is the
Dependency Network model.

FIgure 24-29 The Visio data mining integration includes wizards to help you select the appropriate
source models.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 715

The next page of the wizard allows you to configure specific items for the particular view
type selected. In our example using the Dependency Network view, this next page asks you
to specify the number of nodes fetched (default is 5) and (optionally) to filter displayed nodes
by using a Name Contains query. You’ll also see the Advanced button that let you format the
output for each displayed node. These options are shown in Figure 24-30.

FIgure 24-30 The Dependency Net Options allow you to format the node output.

The last page of the wizard lists the tasks that will be completed, including fetch the infor-
mation, format the information, and so on, and shows you a status value as each step is
performed. The output is then displayed on the Visio workspace. In addition to using the
template or menu to add additional items, you can, of course, use any of Visio’s other nota-
tion capabilities to further document the output. You can also use the Data Mining toolbar.
Figure 24-31 shows sample output. An interesting option available on the Data Mining
toolbar is Add Items. To use this option, select any node on the diagram and then click the
Add Items button. A dialog box opens that queries the mining model metadata and allows
you to select additional related nodes for display on the working diagram. Particular to the
Dependency Network is the Strength Of Association slider that we’ve seen displayed to the
left in the BIDS viewer. This slider functions in the same way, allowing you to add or remove
nodes based on strength of association. However, the slider is displayed to the right of the
working area in Visio. One limit to the data mining visualizations in Visio is that you can
include only one visualization per Visio page.
716 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FIgure 24-31 The Dependency Net view includes the Strength Of Association slider.

Next we’ll take a look at specifics associated with the Decision Tree view. To do that, create a
new, blank page in Visio and then drag the Decision Tree shape from the stencil to that page.
A wizard opens where you again confirm your connection. On the next page of the wizard
you can choose the mining model you want to use. Next, you are presented with formatting
choices that are specific to this algorithm view, as shown in Figure 24-32. Here you select the
particular tree from the model. (Remember that Decision Trees models can house more than
one tree in their output.) Next you select the maximum rendering depth (the default is 3)
and the values and colors for rendering. As with the previous algorithm view, if you click the
Advanced button on this page, you are presented with the ability to further customize the
node formatting for this algorithm.

On the last page, as with the previous model, you are presented with the list of tasks to be
performed and their status as the wizard executes them. After all steps are completed, the
model is rendered onto the blank Visio page. The output displayed shows the decision tree
selected, and like any Visio output can be further formatted to your liking.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 717

FIgure 24-32 The Decision Tree view requires that you select a particular tree from a Decision Tree source
mining model.

The last type of included view is Cluster. To see how this works, you create a third new page
and then drag the Cluster shape onto that page. This opens the Cluster Wizard. Confirm your
connection to SSAS, and on the next page, choose an appropriate mining model. On the
fourth wizard page (shown in Figure 24-33), you are presented with a set of display options
specific to this view. The default is to display the cluster shapes only. As an alternative to the
default display you can choose to show the cluster characteristics or the discrimination chart.
Just as with the other wizards, you can click the Advanced button to further define the for-
mat of the objects Visio displays.

FIgure 24-33 The Cluster Wizard allows you to specify the output view type.
718 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

The last page of this wizard indicates the steps and completion status. After processing com-
pletes, the output is displayed on the Visio page. For our example, we chose Show Clusters
With Characteristics Chart. The output is shown in Figure 24-34.

FIgure 24-34 The Cluster view in Visio

As we end our brief tour of the included data mining shapes in Visio, we remind you that not
only can you produce highly customized visualization results of your processed data mining
models, but you can also use VSTO to further customize the results from Visio.

Client Visualization
As we complete this chapter on using Office 2007 as a client interface for SSAS 2008 data
mining, we’d like to mention a few more current and future possibilities for effectively visu-
alizing the results of data mining models. We do this so that you can be inspired to think
creatively about how you can solve the important problem of visualization for your particular
BI project. We also hope to get application developers in general thinking more about this
challenge. As better visualization technologies are released, such as WPF and Silverlight from
Microsoft as well as products from other vendors, we do believe that more creative, effective,
and elegant solutions will be in high demand.
Chapter 24 Microsoft Office 2007 as a Data Mining Client 719

In the short term, remember that the data viewers included with BIDS, SSMS, and Excel
are available for application developers to embed in custom applications. These embed-
dable controls are downloadable from http://www.sqlserverdatamining.com/ssdm/Home/
Downloads/tabid/60/Default.aspx. Advanced control developers can access the source code
from these controls and extend it, or they can create their own controls from scratch.

It is also interesting to note that new types of development environments are being created
to support new types of visualization. Microsoft’s Expression suite is a set of tools aimed at
visual authoring for WPF and Silverlight applications. Microsoft’s Popfly is a visual program-
ming environment available online that is itself created using Silverlight. Figure 24-35 shows a
Popfly mashup (application). We can see possibilities for combining the input from data min-
ing models along with advanced (and new) types of visualization controls in both traditional
applications and mashups. Access to Popfly (http://www.popfly.com) is free and mashups can
be embedded into applications.

FIgure 24-35 Popfly is a Web-based integrated development environment that allows developers to create
“mashed-up” visualizations.
720 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

In addition to these options, we think it’s quite exciting to consider the future of data mining
visualization controls. A great place to look to understand what this future holds is Microsoft
Research (MSR), which has an entire division devoted to the problem of effective data visu-
alization. They have many interesting projects available to review, which you can find on
their main Web site at http://research.microsoft.com/vibe/. The FacetMap is one example of
an interesting visualization. Take a look at http://research.microsoft.com/vibe/projects/Facet-
Map.aspx. Many of the enhancements in SQL Server 2008 data mining originated directly
from work done at MSR. You can download some data visualization controls that have been
developed by MSR from http://research.microsoft.com/research/downloads/Details/dda33e92-
f0e8-4961-baaa-98160a006c27/Details.aspx.

Data Mining in the Cloud


During the writing of this book, Microsoft previewed Internet-hosted data mining. In August
2008, Microsoft first showed an online sample showcase. The sample includes a subset of the
Table Analysis Tools included on the Excel 2007 Ribbon, as shown in Figure 24-36.

FIgure 24-36 Data mining in the cloud is now available online!


Chapter 24 Microsoft Office 2007 as a Data Mining Client 721

The sample is available at http://www.sqlserverdatamining.com/cloud/. To test this sample,


click the Try It Out In Your Browser button, and then click the Load Data button. You can use
either data from the Adventure Works DW sample database or you can upload a .csv file to
analyze. The loaded data is then displayed on the Data tab of the online application. Next
you select the type of analysis to be performed by clicking one of the toolbar buttons, such
as Analyze Key Influencers. You then configure any parameters required by the algorithms.
Following our example using Analyze Key Influencers, select the target column. The output is
displayed on the Analysis Results tab. As of this writing, cloud-hosted data mining is in early
preview stages only. Not all features available in the Excel data mining client are supported in
this online preview. Pricing has not yet been announced.

Summary
In this chapter we investigated the integration between SQL Server 2008 data mining and
Office 2007 using the free Data Mining Add-ins. Specifically, we addressed initial configu-
ration and then went on to look in depth at the Excel 2007 Table Tools Analyze and Data
Mining tabs. We then looked at the Visio 2007 data mining template. We concluded by tak-
ing a brief look at other client tools and a peek at the future of data visualization.

We hope we’ve conveyed our excitement about the possibilities of using data mining in your
current BI projects resulting from the power and usability of the end-user tools included in
Office 2007.
Chapter 25
SQL Server Business Intelligence
and Microsoft Office
SharePoint Server 2007
The release of Microsoft Office SharePoint Server 2007 has meant many things to many
people. To some, it was a development platform that embraced the proven techniques of
ASP.NET 2.0; to some, it was a workflow and business process automation engine. To others,
it was a content management system to manage and surface important content within an
organization.

An integral piece of the Office SharePoint Server 2007 pie was the business intelligence (BI)
integration slice. Although Microsoft has included rich BI capabilities in its SQL Server line of
products for the past several releases of SQL Server, the integration with SharePoint Server
prior to the 2007 release was very rudimentary.

With Office SharePoint Server 2007, you can now surface BI capabilities for information work-
ers using a familiar interface. We look specifically at the integration between SQL Server
Analysis Services (SSAS) and Office SharePoint Server 2007 in this chapter. This particular
section builds on our discussions in Chapters 20 through 24 regarding SQL Server Reporting
Services (SSRS) and Microsoft Office Excel 2007 integration.

In this chapter, we focus on two specific BI capabilities that Office SharePoint Server 2007
brings to the table. The first feature is Excel Services, which refers to the ability of Office
SharePoint Server 2007 to enable business users to apply, in a SharePoint Web-based portal,
the Excel skills they have developed over the years.

The second set of features is related to SQL Server Reporting Services integration with Office
SharePoint Server 2007. These features allow solution providers to surface automated SQL
Server reporting information from within an Office SharePoint Server portal where business
users do their work.

Excel Services
We’ve emphasized that you should select a client tool that’s easy for you to use, and that this
choice is critical to the success of your BI project. As you saw in Chapter 24, “Microsoft Office
2007 as a Data Mining Client,” end users often prefer to start the BI process using data stored
in Excel. For many years, business users have been using Excel as their own personal data
repository, and they’ll probably continue to do so.
723
724 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Although maintaining local data might seem contrary to the goals of an enterprise BI proj-
ect, we’ve found a close correlation between Excel use prior to a BI project and the adoption
rate for Excel or Excel Services as a client to BI data stored in OLAP cubes and data mining
structures after the implementation of BI projects. As capable as Excel is, it’s still a thick-
client, desktop-limited application. The workbooks that you or your end users author live on
their desks. When someone asks for a new version, you e-mail your workbook to them. As
you might have guessed (or experienced!), this ends up creating major versioning problems.
These versioning problems span both data and logic, and they can include other challenges
such as inappropriate sharing of embedded information, such as SQL Server connection
strings (that might include passwords) or logic that is a part of Visual Studio Tools for Office
(VSTO) dynamic-link libraries (DLLs) within your Excel sheet. Thus, you’ve probably found it
difficult over the years to share all the valuable insight that you have built into your Excel
sheets.

Excel Services addresses these problems. Excel Services is part of the Microsoft Office
SharePoint Server technology stack (Enterprise edition only). Excel Services makes it simple
to use, share, secure, and manage Excel 2007 workbooks (.xlsx and .xlsb file formats only) as
interactive reports that can be delivered to authorized end users either through the Web
browser or by using Web services in a consistent manner throughout the enterprise. Be sure
to keep in mind that only Microsoft Office SharePoint Server 2007 Enterprise edition can be
used to leverage Excel Services as a BI client for your solution.

To summarize, here are the important points to remember:

■■ Excel Services is available only with Excel 2007 and the new Excel 2007 formats.
■■ Excel Services will not support Excel sheets with embedded logic, such as macros,
ActiveX controls, or VSTO DLLs. Only .xlsx, and .xlsb files are rendered in Excel Services.
■■ Excel Services does not give you a server-side version of Excel. Instead, it gives you the
capability of sharing an Excel sheet as an interactive report.
■■ This interactive report, along with all its logic, is exposed to the enterprise either
through the Web browser or Web services.
■■ Charts are rendered as static images, so interactivity for SSAS purposes is limited unless
you use a PivotTable.

Basic Architecture of Excel Services


The diagram shown in Figure 25-1 illustrates the major components of Excel Services.

From Figure 25-1, you can tell that at the heart of Excel Services is Excel Calculation Services.
Excel Calculation Services is the part of Excel Services that runs on the application server (or
the server running Office SharePoint Server 2007), and it has the responsibility of loading
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 725

workbooks, calculating workbooks, calling custom code as user-defined functions, and


refreshing external data. In addition, the Excel Calculation Services piece is also responsible
for maintaining session information for the same workbook and the same caller.

SharePoint Web Front End

Excel Web Access Excel Web Services

SharePoint Application Server

User-Defined Functions Excel Calculation Services

SharePoint Content Database


External Data Sources
(Such as SQL Server) Excel Workbooks

FigurE 25-1 Excel Services architecture

Excel Calculation Services is also responsible for caching open Excel workbooks, calculation
states, or external data query results. Because of this caching responsibility, when either of
the front-end pieces—the Web service or Web Parts—is requested to render an Excel sheet,
it queries Excel Calculation Services to provide the information it needs to render. Excel
Calculation Services then either presents a cached version of the Excel sheet or loads it as
necessary.

When Excel Calculation Services needs to load an Excel sheet, it queries the metadata for the
selected sheet from the Office SharePoint Server 2007 content database and then starts pro-
cessing it.

The loaded workbook might have external data connections, in which case Excel Calculation
Services queries the external data source and refreshes the Excel sheet data accordingly. Or
the loaded Excel sheet might have embellished its calculation abilities by using user-defined
functions (UDFs) that are written in .NET. Excel Calculation Services will then look for such
UDFs and load and call them as necessary.

When Excel Calculation Services is done processing the Excel workbook, it hands over the
information as requested to the Web service or Web Parts, whichever the case may be.
726 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Immutability of Excel Sheets


You need to understand this very important fact: Excel Calculation Services does not give you
the ability to edit an Excel sheet in this release. It merely lets you view the Excel sheet in its
current form through either a Web service or Web browser.

As you’ll see later in this chapter, you do have the ability to change the values of certain pre-
defined cells as parameters during your session. That ability allows you to view a new snap-
shot of the data without having to change the Excel sheet. You also have the ability to export
that snapshot using the Web Parts UI. What you cannot do is allow Excel Services to save that
snapshot back in the document library.

Although you might be wondering about the usefulness of this feature set because of its
immutability limitation, we’ve actually found this restriction to be acceptable because the
core business requirement to be able to expose an Excel sheet through the browser is often
equivalent to the requirement of providing “one version of the truth.” If any user connecting
to the document library using a Web browser is able to edit the sheet, the sheet’s credibility
becomes questionable.

Of course, the read-only limitation does not meet the needs of all BI scenarios. However, we
find that it does have its place for some clients—usually for clients who want to present a
quick view of one or more business metrics to a large number of users.

Introductory Sample Excel Services Worksheet


We’ll walk you through a couple of examples so that you can better understand the core
functionality of Excel Services. In these examples, we’re intentionally focusing on the func-
tionality of Excel Services itself, rather than trying to showcase Excel Services being used with
SSAS data. To that end, we’ll just use relational source data, rather than using multidimen-
sional or data mining structures as source data. You can, of course, use any data source type
that is supported in Excel as a data source for Excel Services.

We’re working with a basic server farm running Office SharePoint Server 2007 that includes
the following:

■■ A Shared Service Provider (SSP) has been set up on port 4000.


■■ A central administration site has been provisioned on port 40000.
■■ A front-end Web site has been provisioned on port 80.
■■ Additions to My Sites have been configured to be created on port 4001.

The preceding port numbers and components, such as My Sites, are not necessary for Excel
Services. For this example, the full version of Office SharePoint Server 2007, rather than
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 727

Windows SharePoint Services, is required. This is because only Office SharePoint Server 2007
contains the MySites template. Also, only Office SharePoint Server 2007 contains the SSP
container for system settings. And, as mentioned, only Office SharePoint Server 2007 (and
not Windows SharePoint Services) contains the Excel Services feature set itself.

In addition to the details listed, we have created a site collection on port 80 using the blank
site collection.

The aim of this exercise is to author a simple Excel sheet and render it through Excel Services.
The following steps lead you through this exercise:

1. Create a place in the front-end site to store Excel sheets. This is as simple as creating a
document library that holds Excel 2007 sheets. The document library is called sheets.
2. Configure the SSP to allow a certain document library to be used in Excel Services.
3. Go to the SSP for your server or farm.
4. Locate the Excel Services Settings section, and click Trusted File Locations.
5. Click Add Trusted File Location.
6. Type http://<yourservername>/sheets as the address of the document library that
will hold your Excel sheets. Also, indicate that you intend to trust it to be used with
Excel Services.
7. Leave the rest of the default settings as they are, and click OK.
8. Author and publish an Excel sheet that will get rendered in Excel Services.
9. Start Excel 2007, and author a sheet as shown in Figure 25-2. Note that cell B5 is a
formula.

FigurE 25-2 A sample Excel sheet

10. Click the Microsoft Office Button, and then click Publish in the left pane. This opens the
Publish submenu. From there, click Excel Services to publish your workbook to Excel
Services, as shown in Figure 25-3.
If you’re unable to find an Excel Services submenu item on the Publish menu in your
Excel application, chances are that you’re not running Office Professional or Ultimate.
In that case, you can simply upload the Excel sheet to the document library. For certain
728 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

functions, such as parameterized Excel sheets, you need to have Office Professional or
Ultimate. For a comparison of Office edition features, see http://office.microsoft.com/
en-us/suites/FX101757671033.aspx.

Publish to Excel Services

FigurE 25-3 Publishing to Excel Services

11. In the Save As dialog box that appears, type http://<yourservername> as the save
location, and save the sheet to the sheets document library you created earlier. Save
the sheet as MyExpenses.xlsx.
12. Next, you need to edit the front-end site so that it can display Excel sheets in the
browser using out-of-the-box Web Parts.
13. Before you can use the Excel Services Web Parts, you need to enable them. You do this
by enabling the Office SharePoint Server 2007 Enterprise Site Collection features in
the Site Collection Features section of your port 80 site collection in Office SharePoint
Server 2007 by choosing Site Actions, Site Settings, and then Site Features from the
main (or top) page of your portal.
14. Browse to the port 80 site collection you created earlier, and choose Site Actions, Edit
Page to edit the home page.
15. Click Add A Web Part in the left pane, and click to add the Excel Web Access Web Part.
The Web Part prompts you to select a workbook. Click the link shown in the prompt to
open the tool pane.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 729

16. In the tool pane, locate the workbook text box, and enter the path to the
MyExpenses.xlsx workbook you uploaded earlier in your sheets document library.
You can also browse to it by clicking the button next to the text box.
17. Click OK, and then click the Exit Edit Mode link to return to the viewing mode.

You should now see the workbook rendered in the browser, as shown in Figure 25-4.

FigurE 25-4 The Excel sheet running under Excel Services

In this first walkthrough, we listed the steps you need to take to publish a simple read-only
Excel workbook to Excel Services in Office SharePoint Server 2007. In addition to performing
simple publishing tasks, you can enable parameters. We’ll take a detailed look at how to do
that next.

Publishing Parameterized Excel Sheets


We find that a more common BI business requirement than publishing a simple Excel
workbook for BI projects is the need to publish parameterized workbooks. This is easy to
accomplish.

The first thing you need to do is edit your Excel sheet by adding a pie chart to it, as shown in
Figure 25-5. To quickly create a pie chart using source data from an open workbook, click on
any section (cell or group of cells) containing data or labels, click Insert on the Ribbon, and
then click the Pie (chart) button to create the type of pie chart you want to add.
730 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigurE 25-5 Adding a pie chart to your Excel sheet

Next, let’s assume you need to include information to enable users to edit the value of cell
B3, which is the expense for gas, and view an updated pie chart. To provide this capability,
you need to give B3 a defined name, such as GasExpense. To do this, select the B3 cell, click
the Formulas tab on the Ribbon, and click the Define Name button. Define a new name as
shown in Figure 25-6.

FigurE 25-6 Defining a name for the cell that shows gas expense

Next, republish the Excel sheet as you did earlier (using the steps described in the previous
section), with one difference. This time, click the Excel Services Options button in the Save As
dialog box as shown in Figure 25-7.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 731

FigurE 25-7 The Excel Services Options button

In the Excel Services Options dialog box, click the Parameters tab, click the Add button, and
select GasExpense to add it as a parameter. When you publish the sheet to Excel Services and
render it in the browser, the end user will be able to change the value of the gas expense
parameter using the GasExpense text box found in the Parameters task pane, as shown in
Figure 25-8.

To verify the functionality, you simply enter a new value for GasExpense. (In this example, we
entered a lower expense of 20.) Then click the Apply button to refresh the pie chart.

As we mentioned earlier, this update does not affect the base Excel sheet stored in the docu-
ment library—that is immutable when using Excel Services. The user’s changes, including
his parameters, are lost as soon as he closes the browser or his session times out. If the user
wants to export a snapshot of his changes, he can do so through the Internet Explorer tool-
bar by choosing Open and then Open Snapshot on the Excel menu.

Click the Excel Services Options button to see what other options are available. You can
choose to publish the entire workbook, specific sheets, or even individual charts and ranges.
732 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

GasExpense
text box

FigurE 25-8 Editing Excel Services parameters

Before we leave the subject of Excel Services core functionality, we’ll remind you that Excel
supports SSAS OLAP cube source data. This, of course, can include calculations and key per-
formance indicators (KPIs). For example, one business scenario that we’ve been asked to sup-
port is a dashboard-like display of OLAP cube KPIs via Excel Services. We’ve also been asked
to simply implement Excel Services to centralize storage of key workbooks that had been
previously scattered across users’ desktop computers.

Excel Services: The Web Services API


As we discussed in Chapter 23 and Chapter 24, you can use VSTO to extend Excel program-
matically. You might wonder whether (and how) you can programmatically extend Excel
Services. To do this, you can work with the Excel Web Services API. This allows external appli-
cations to aggregate Excel Services information over standard ASMX calls.

To work with this API, you start by accessing the Web service endpoint. The ASMX endpoint
is exposed at http://<your_server>/_vti_bin/ExcelService.asmx. ASMXs are classic Web services
that can be used with Windows Communication Foundation (WCF) using the default basic
HttpBinding.

Note If you’re unfamiliar with calling WCF services from a client application, see the walk-
through on MSDN titled “How to: Create a Windows Communication Foundation Client” at
http://msdn.microsoft.com/en-us/library/ms733133.aspx.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 733

As an example, the Excel sheet mentioned earlier in this section can be calculated on the
server side, and a WCF proxy can be created and used as follows:

ExcelService.ExcelServiceSoapClient client = new ExcelService.ExcelServiceSoapClient();


client.ClientCredentials.Windows.AllowedImpersonationLevel =
System.Security.Principal.TokenImpersonationLevel.Impersonation;

ExcelService.Status[] outStatus;
string sessionID =
client.OpenWorkbook(docLibAddress + excelFileName, "en-US", "en-US", out outStatus);
outStatus = client.Calculate(sessionID, "Sheet1", rc);
object o = client.GetCell(sessionID, "Sheet1", 5, 2, false, out outStatus);

In the preceding code snippet, docLibAddress is a string that contains the path to the docu-
ment library that stores your Excel sheets and excelFileName is a string that is the actual file
name.

The important thing to remember here is that with Web services in .NET 2.0, by default your
Windows identity was propagated all the way to the server running the Web service. Office
SharePoint Server 2007 expects to see your Windows identity when calling a Web service.

When using WCF, however, security is configurable, and it’s rightfully made anonymous
by default. If your business requirements are such that you’d like to revert from WCF-style
authentication to ASMX-style implicit Windows authentication, you have to add the following
security section of the basicHttpBinding configuration location in your application’s configu-
ration file (usually named app.config):

<security mode="TransportCredentialOnly">
<transport clientCredentialType="Ntlm" proxyCredentialType="Ntlm" realm=""/>
<message clientCredentialType="UserName" algorithmSuite="Default" />
</security>

A Real-World Excel Services Example


So far, all the examples we’ve presented have walked you through simple examples that
served well in explaining the basics of Excel Services. With the basics behind us, we’ll next
work through a more complex example that targets the Northwind database and presents an
interactive PivotTable using infrastructure provided by Excel Services. Note that you can also
connect to SSAS data by clicking From Other Sources on the Data tab of the Ribbon in Excel
and then clicking From Analysis Services in the drop-down list.

Note You can download the Northwind sample database from http://www.microsoft.com/down-
loads/details.aspx?FamilyID=06616212-0356-46A0-8DA2-EEBC53A68034&displaylang=en.
734 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

You can use the following steps to walk through this exercise:

1. Start Excel 2007, and create a new workbook.


2. Click the Data tab on the Ribbon.
3. Click From Other Sources and then click From SQL Server in the drop-down list, as
shown in Figure 25-9.

FigurE 25-9 Excel data sources

After you click From SQL Server, the Data Connection Wizard opens. The Connect To Data
Source page asks you for the server name and login credentials. On the next page of the wiz-
ard, select the sample database and table. We are using the Northwind database, and specifi-
cally the Orders table. On the last page of the Data Connection Wizard, you need to fill in the
.odc file name and, optionally, a description, a friendly name, and search keywords.

To make the data connection easily shareable, save the .odc file in a Data Connection docu-
ment library on the Office SharePoint Server 2007 front-end port 80 site collection. In our
example, we saved the results in /dataconnections/NorthwindOrders.odc To save the .odc in
this location, click the Browse button in the File Name section of the Save Data Connection
File And Finish page of the Data Connection Wizard. This opens the File Save dialog box,
where you’ll select the Office SharePoint Server 2007 Data Connection library location from
the list on the left. If this location does not appear, you’ll have to manually enter it.

Note Office SharePoint Server 2007 includes a Data Connection document library as one of the
default templates in the Report Center group of templates.

After you save the data connection file, you will be prompted to import the data into your
Excel sheet and the Import Data dialog box will open. It is set by default to import your data
into Excel in a tabular (table) format. Choose the PivotTable Report option to import the data
as a PivotTable, as shown in Figure 25-10. Then click OK.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 735

FigurE 25-10 Importing information into Excel

After you click OK, the PivotTable Field List opens. There you will be presented with a list
of fields that you can add to your PivotTable by clicking the box next to their name. At the
bottom of the PivotTable Field List, there are four areas that you can use to lay out your
PivotTable data.

Next, choose to format your PivotTable report as shown in Figure 25-11 by clicking on fields
in the Choose Fields To Add To Report section of the PivotTable Field List and dragging those
fields to one of the four areas at the bottom of the PivotTable Field List window: Report Filter,
Column Labels, Row Labels, or Values. We’ve used ShipCountry as a report filter, ShipRegion
as a row label, and Count Of ShippedDate as the displayed data aggregate value.

FigurE 25-11 Setting up your PivotTable

After you’ve set a filter on the ShipCountry value to USA, publish the Excel sheet to Excel
Services. The rendered sheet will look like Figure 25-12.
736 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigurE 25-12 The PivotTable in a browser

You can verify that you have PivotTable interactivity built right into the Web-based inter-
face after you publish your report to Office SharePoint Server 2007 Excel Services and view
the page where you host Excel Services Web Parts from Office SharePoint Server 2007 in a
browser. There, you’ll be able to set filters on the columns, rows, and the ShipCountry filter
variable.

As we have shown, the capability of being able to host Excel-sourced reports on a Web-
based UI, with back-end data from SSAS or other data sources, is quite powerful. You will
note that we did not have to write a single line of code to accomplish this task. Note also that
the Excel Services interface is exposed as a Web service, so if you want to extend its capa-
bilities programmatically you can do so by working with the publicly exposed methods of
its API.

SQL Server reporting Services with Office


SharePoint Server 2007
So far in this chapter, we’ve talked about Excel Services being used as a BI tool with Office
SharePoint Server 2007. Although you can certainly use Excel Services as your BI user por-
tal, the limits of Excel might not match your business requirements. Excel Services allows
the business user to achieve simple tasks and manage simple sets of data. In spite of Excel’s
ability to connect to external data sources, Excel Services targets one specific type of client
only—users who are already comfortable working in some version of Excel. We see this user
population typically as business analysts. Of course, there are exceptions to this characteriza-
tion; however, we most often use Excel or Excel Services as one part of our client interfaces
for BI projects.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 737

As we saw in Chapters 20 through 22, SSRS is a sophisticated and powerful end-user client
BI tool. Office SharePoint Server 2007 can integrate very closely with SSRS and then render
reports that you create in SSRS inside the information worker’s work portal.

Configuring SQL Server Reporting Services


with Office SharePoint Server 2007
SSRS can work with Office SharePoint Server 2007 in two configuration modes: native mode
or SharePoint integrated mode. In either mode, the SSRS reports are authored using the
Business Intelligence Development Studio (BIDS) SSRS template or Report Builder. They are
then rendered in the browser inside an Office SharePoint Server 2007 UI. It is important that
you understand a bit more about the implications of using these two configuration modes as
you consider using SSRS in Office SharePoint Server 2007 as a client for your BI solution.

The major difference between native mode and SharePoint integrated mode is that
SharePoint integrated mode lets you deploy and manage both the reports and the relevant
data connections in SharePoint document libraries. This reduces the administrative overhead
for your solution.

If you choose to use SharePoint integrated mode, you first need to configure SSRS to use this
mode rather than the default configuration for SSRS, which is to use native mode. Also, to use
SharePoint integrated mode, SSRS must be installed on the same physical machine as Office
SharePoint Server 2007. To set up SSRS in SharePoint integrated mode, on the Start menu,
click the Reporting Services Configuration Manager link. Navigate to the Web Service URL
section and set up a virtual directory called ReportServer on a port other than what Office
SharePoint Server 2007 is using. This is important because SQL Server 2008 does not use
Internet Information Services (IIS) and Office SharePoint Server 2007 cannot natively share a
port.

Note By using stsadm.exe –exclude, you can configure Office SharePoint Server 2007 to share
(or exclude) specific URLs, such as that of an SSRS instance. As a best practice, we generally use
separate ports when we host both SSRS and Office SharePoint Server 2007 on the same server.

Next, in the Database section, create a new database with a unique name and then select the
desired mode—for example, native or SharePoint integrated. When both Office SharePoint
Server 2007 and SSRS have been selected as client tools for a BI project, we use SharePoint
integrated mode more often than native mode because of the simplified administration it
provides. This simplification also includes fewer metadata tables for SSRS. In SharePoint inte-
grated mode, SSRS stores its own metadata in tables inside the configured Office SharePoint
Server 2007 metadata databases rather than in unique databases for SSRS.
738 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

The next step is to create a Report Manager URL. We created http://<yourserver-


name>:10000/Reports for our example. It might not be obvious given the UI, but a Report
Manager virtual directory is not created for you by default. This is indicated by the message
shown in Figure 25-13. The reason for this is to give you greater control over the particular
URL that is being used for SSRS. After you’ve entered your desired URL, you must click Apply
to create the Report Manager instance at this location. This last step is optional; however, we
usually implement SSRS and Office SharePoint Server 2007 integration using this indepen-
dent URL option for SSRS.

FigurE 25-13 Configuring the Report Manager URL

In the next step, you create a simple report and deploy it so that you can see it displayed in
Office SharePoint Server 2007.

Authoring and Deploying a Report


To create a sample report, open BIDS and author a new Report Server project named
MyReports under the Business Intelligence Projects category, as shown in Figure 25-14.

In the Shared Data Sources section, add a new data source called Northwind, and use it to
connect to the Northwind database. Next, in the Reports section, right-click and choose Add,
New Item, Report. On the report designer surface, drag and drop a Table data region from
the Toolbox to the report surface. When prompted to provide a data source, click the link,
and then choose the existing Northwind data source as shown in Figure 25-15. Click OK.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 739

FigurE 25-14 Creating a Report Server project

FigurE 25-15 Picking the proper data source


740 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

Next you need to get some data to display on your report. To do this, type a sample query
(targeting the Customers table). This creates the necessary dataset. Then format the results to
create a report as shown in Figure 25-16. For our example, we used the following query:

Select CustomerID, ContactName, Address, City, PostalCode, Country


From Customers

FigurE 25-16 The report in design mode

For the last step, you need to specify the deployment settings.

If you’re using native mode, in the project settings dialog box, specify the target server URL
as http://<yourservername>:10000/reportServer.

If you’re using SharePoint integrated mode, you need to specify these values:

■■ The target data source folder as a Data Connection document library on your
SharePoint site
■■ The target report folder as another document library on your SharePoint site
■■ The target server URL as the SharePoint site itself

With the project settings established, go ahead and deploy the project by right-clicking the
project name and clicking Deploy on the shortcut menu.

Using the Report in Office SharePoint Server 2007:


Native Mode
If you want to use the report in native mode, accept the default SSRS configuration settings.
For both SharePoint integrated mode and native mode, you’ll want to display the report on a
SharePoint Web page.

To do this, you have several options. The simplest option is to use the Web Parts that are
designed to host SSRS reports. These Web Parts are shipped and copied locally when Office
SharePoint Server 2007 is installed, but they are not enabled on a SharePoint site by default.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 741

To make these Web Parts available for a particular SharePoint site, you must first enable
the SSRS Web Parts for Office SharePoint Server 2007. These Web Parts are named Report
Explorer and Report Viewer. Report Explorer returns a list of reports at a particular SSRS
directory location. Report Viewer hosts the SSRS standard report viewer control in an
I-frame-like container.

To make these Web Parts available for authorized end users to add to pages on a SharePoint
site, you (as an authorized site administrator) must run the following command line:

stsadm.exe -o addwppack -filename C:\Program Files\Microsoft


SQL Server\100\Tools\Reporting Services\SharePoint\RSWebParts.cab

This command activates the Web Parts. The Web Parts are included with a standard SQL
Server 2008 installation, but they are not activated by default with the Enterprise edition of
Office SharePoint Server 2007. End users must have appropriate permission to install (or add)
any Web Parts to selected SharePoint Web site pages. Also, all Office SharePoint Server 2007
Web Parts require Code Access Security (CAS) permissions for activation. You can adjust the
default settings for the CAS permissions depending on the functionality included in the SSRS
reports that you plan to host on your Office SharePoint Server 2007 instance.

Note CAS in Office SharePoint Server 2007 is expressed as a collection of CAS permissions. Each
collection contains multiple individual permissions that are assigned to DLLs that meet the crite-
ria—that is, they have a certain name, version, and so on. Permission sets include Full Trust and
lesser (more granular) permissions as well. You should refrain from adding Web Parts that have
the Full Trust permission set associated with them because they can pose a security risk.

After you’ve made the SSRS Web Parts available as part of your Office SharePoint Server
2007 instance, you need to add them to a page on your portal. To do this, browse to your
SharePoint Web site, and then put the selected page into edit mode. Next, add a Web Part to
a selected area of the editable page. Select the Report Viewer Web Part from the Add Web
Parts dialog box as shown in Figure 25-17. Note that this dialog box also includes a Report
Explorer Web Part.

For the last step, you need to configure the connection from the Office SharePoint Server
2007 Report Viewer Web Part to your particular SSRS instance. To do this, you add informa-
tion to the properties of the Web Part. In the properties section of the Report Viewer Web
Part, change the Report Manager URL to http://<yourservername>:10000/reports and
the report path to exactly this: /Myreports/report1.

You should now be able to view the report when running Office SharePoint Server 2007.
742 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigurE 25-17 Report Viewer Web Part

Using the Report in Office SharePoint Server 2007:


SharePoint Integrated Mode
Before you can use SSRS in SharePoint integrated mode, you first need to install the
Microsoft SQL Server 2008 Reporting Services Add-in for Microsoft SharePoint technologies
on all involved Web front-end SSRS servers. SSRS can be installed on a single server, or its
components can be scaled to separate servers.

After installing the Reporting Services add-in on the server running Office SharePoint Server
2007 where the front-end SSRS components are installed, you need to activate the Reporting
Services Integration feature on both the Central Administration site of the Office SharePoint
Server 2007 instance and the front-end Web site where you want to use the reports.

In Central Administration, go to Application Management, and click Grant Database Access


under Reporting Services, as shown in Figure 25-18.

In providing the relevant accounts with the appropriate database access, Office SharePoint
Server 2007 prompts you to enter credentials for a user that has administrative rights on the
domain. Make sure to enter the user name in the Domain\Username format.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 743

FigurE 25-18 Granting database access in Central Administration

Next, under Central Administration\Application Management\Reporting Services, click


Manage Integration Settings. In the Reporting Services Integration window, provide the set-
tings as shown in Figure 25-19.

FigurE 25-19 Configuring Office SharePoint Server 2007 with SQL Server Reporting Services Integration
settings

Next, in your front-end site, after having deployed the reports and activated the Reporting
Services Integration feature, edit the home page and add the SQL Server Reporting Services
Report Viewer Web Part as shown in Figure 25-20.

You can now configure this Web Part and point it to the report that you deployed in the
document library earlier. This results in you being able to render a selected SSRS report on
the SharePoint page where you’ve added the ReportViewer control.
744 Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence

FigurE 25-20 Picking the SQL Server Reporting Services Report Viewer Web Part

Using the Report Center Templates


There is a set of template pages that ship with a default Enterprise installation of Office
SharePoint Server 2007. Each of these page types includes Web Parts that you might want
to use to display some of your BI reports for authorized end users. The top-level page for
this group is named Report Center. Report Center contains linked pages where you can store
content of the following types: data connections, reports, KPIs, and more.

Also, the top-level page includes custom menu items related to reports for end users. These
menus include notifications and more. If you intend to use Office SharePoint Server 2007 as
a host for BI reports implemented via SSRS, you might want to examine the built-in function-
ality in the Report Center templates to see whether it can meet any of your project’s busi-
ness needs. Report Center includes, for example, a specialized type of SharePoint document
library called Reports Library. This template contains functionality specific to hosting *.rdl-
based reports, such as integrated SSRS report uploading and display.

The Office SharePoint Server 2007 Report Center also contains templates to help you quickly
create BI dashboards. A BI dashboard commonly contains Web Parts that display various BI
metrics. These often include the display of KPIs (OLAP cube-based KPIs, Excel workbook-
based KPIs, or KPIs locally hosted in Office SharePoint Server 2007), reports, and Excel work-
books or charts.
Chapter 25 SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 745

Although we sometimes use these templates, in our real-world experience, we most often
create custom pages and add the included Report Viewer and Report Explorer controls. We
do this because most of our customers want to implement greater customization on their
portal in their BI end-user implementation than is provided by the Report Center templates.

For more information on these templates, see the SQL Server Books Online topic
“Understanding the Report Center and Dashboards in SharePoint 2007” at http://msdn.micro-
soft.com/en-us/library/bb966994.aspx.

PerformancePoint Server
Another Microsoft product that provides rich integration with SQL Server SSAS is
PerformancePoint Server (PPS). While in-depth coverage of PPS is beyond the scope of this
book, we often choose PPS as one of our client tools in BI solutions, and did provide some
coverage of writing MDX for it in Chapter 11, “Advanced MDX.” We encourage you to explore
online resources to investigate the integration capabilities PPS offers. A good starting point is
http://office.microsoft.com/en-us/performancepoint/FX101680481033.aspx.

Summary
Microsoft SharePoint technologies have enjoyed increasingly wide adoption. We find that
this is particularly driven by features added in the 2007 release. The SQL Server 2008 SSAS
BI integration capabilities available in Office SharePoint Server 2007 are seen by many of our
customers as a critical driver of adoption of the SharePoint platform in their enterprise. Office
SharePoint Server 2007 enables rich BI capabilities by using Excel Services or SQL Server
Reporting Services. It’s only reasonable to assume that you’ll see further investments in this
arena from both Microsoft and third-party vendors.
Index
A Agile Software Development.
See MSF (Microsoft Solution
ascmd.exe tool, 157
ASMX files, 732
AccessMode property, 449 Framework) for Agile Software assemblies
account intelligence, configuring in Development compiled, using with SSAS
Business Intelligence Wizard, Algorithm Parameters dialog box, objects, 196–197
243, 246–247 367, 376–377, 378, 710 custom, adding to SSRS reports,
AcquireConnections method, 581 algorithms, data mining 647–649
actions, SSAS association category, 359 custom, creating, 197
defined, 149, 233 classification category, 358 default, in SSAS, 197
drillthrough, 233, 236–238 clustering category, 359 association algorithms, 359
regular, 233, 234–235 configuring parameters, 367, 378 Association Wizard, 48
reporting, 233, 235–236 in data mining models, 45, 46, asynchronous data flow outputs,
Add SourceSafe Database Wizard, 46–47, 358 459
541–542 forecasting and regression asynchronous transformation,
AddRow method, 582 category, 359 583–586
administrative scripting, SSRS, Microsoft Association algorithm, attribute hierarchies, in OLAP cube
667–669 391–393 design, 206–207
ADO.NET connection manager, 473 Microsoft Clustering algorithm, attribute ordering, specifying in
ADO.NET data flow destination, 386–389 Business Intelligence Wizard,
485, 486 Microsoft Decision Trees 244, 250
ADO.NET data flow source, 483, algorithm, 381–383 attribute relationships, in BIDS, 139,
497–498, 597 Microsoft Linear Regression 205, 207–209, 223
Agent. See SQL Server Agent algorithm, 383 Audit transformation, 524
Aggregate data flow transformation, Microsoft Logistic Regression auditing. See also SQL Server
486, 487, 488 algorithm, 395–396 Profiler
Aggregation Design Wizard, Microsoft Naïve Bayes algorithm, added features in SQL Server
271–273 376–381, 518 2008, 111
aggregations Microsoft Neural Network using SQL Server Profiler, 109–110
Aggregation Design Wizard, algorithm, 394–395 authentication
271–273 Microsoft Sequence Clustering credential flow in SSRS reports,
built-in types, 147 algorithm, 389–390 103
configuring, 262–263 Microsoft Time Series algorithm, requesting access to Report
creating designs manually, 383–386 Server, 610–611
277–278 sequence analysis and prediction AverageOfChildren aggregate
defined, 9 category, 359 function, 147
and fact tables, 261 supervised vs. unsupervised, 376
implementing, 270–278 viewer types for, 369–370
key points, 271 ALTER MINING STRUCTURE (DMX) B
main reason to add, 271 syntax, 366 background processing, for reports
overview, 261–263 Analysis Management Objects and subscriptions, 612
and query processing, 262 (AMOs), 31 backups and restores
question of need for, 270–271 Analysis Services. See SQL Server overview, 106
role of SQL Server Profiler, Analysis Services (SSAS) for SQL Server Analysis Services,
275–277 Analysis Services Processing task, 106–107
in SQL Server cube store vs. 430, 530 for SQL Server Integration
Transact-SQL, 9 analytical activities. See OLAP Services, 107–108, 112
Usage-Based Optimization (online analytical processing) for SQL Server Reporting Services,
Wizard, 274–275 Ancestors MDX function, 319–320 108
using with date functions, Application class, 596, 599 Barnes and Noble, 28
324–326 applications, custom, integrating
viewing and refining, 262–263 SSIS packages in, 596–600

747
748 BI solutions

BI solutions. See also Business complete solution components, deploying SSIS packages, 553–556
Intelligence Development 50–54 Deployment Progress window, 41
Studio (BIDS) customizing data display in SQL development tips, 70
case studies, 27–33 Server 2008, 3 disconnected instances, 259–261
common challenges, 54–56 defined, 3 Error List window, 31
common terminology, 11–15 development productivity tips, 70 exploring dimension table data,
complete solution components, in law enforcement, 29 123
50–54 localization of data, 29 exploring fact table data, 120
customizing data display in SQL measuring solution ROIs, 56–58 MDX Query Designer, 628–631
Server 2008, 3 MSF project phases, 65–71 New Cube Wizard, 134
defined, 3 multiple servers for solutions, 4 OLAP cubes, adding capabilities,
development productivity tips, 70 and Office SharePoint Server, 225–255
in law enforcement, 29 723–745 OLAP cubes, using to design,
localization of data, 29 process and people issues, 61–83 183–223
measuring solution ROIs, 56–58 project implementation scope, 28 online vs. offline mode, 184–186
MSF project phases, 65–71 query language options, 23–25 opening sample databases, 39–43
multiple servers for solutions, 4 relational and non-relational data overview, 183–186
and Office SharePoint Server, sources, 22–23 as primary development tool for
723–745 reporting interfaces, 3 SSIS packages, 20, 439–440,
process and people issues, 61–83 role of Microsoft Excel, 36–37, 463–495
project implementation scope, 28 43–50 processing options for cubes and
query language options, 23–25 sales and marketing, 29 dimensions, 287–291
relational and non-relational data schema-first vs. data-first relationship to Visual Studio,
sources, 22–23 approaches to design phase, 16–17, 22, 41, 463
reporting interfaces, 3 130 Report Data window, 635–638
role of Microsoft Excel, 36–37, security requirements for resemblance to Visual Studio
43–50 solutions, 95–106 interface, 157
sales and marketing, 29 skills necessary for projects, 72–76 role designer, 195–196
schema-first vs. data-first software life cycle, 28 running on x64 systems, 91
approaches to design phase, solution core components, 16–20 Solution Explorer window, 40, 46,
130 solution optional components, 184, 186–188
security requirements for 21–23 source control considerations, 113
solutions, 95–106 testing project results, 70–71 SSRS Toolbox, 621–622, 638
skills necessary for projects, 72–76 top 10 scoping questions, 30 working with SSAS databases in
software life cycle, 28 visualizing solutions, 34–36 connected mode, 261
solution core components, 16–20 Business Intelligence Development working with two instances open,
solution optional components, Studio (BIDS). See also SQL 225
21–23 Server Analysis Services (SSAS) Business Intelligence Wizard
testing project results, 70–71 BIDS Helper tool, 255, 490, 494, accessing, 243
top 10 scoping questions, 30 510 Create A Custom Member
visualizing solutions, 34–36 compared with Visio for creating Formula, 244, 251
BIDS. See Business Intelligence OLAP models, 133 Define Account Intelligence, 243,
Development Studio (BIDS) as core tool for developing 246–247
BIDS Helper tool, 255, 490, 494, 510 OLAP cubes and data mining Define Currency Conversion, 244,
Biztalk Server, 22 structures, 16, 40, 157 251–254
Boolean data type, 363 creating new SSIS project Define Dimension Intelligence,
BottomCount MDX statement, 311 templates by using New Project 243, 250
breakpoints, inserting, 505–506 dialog box, 464–465 Define Semiadditive Behavior,
build, defined, 259 creating or updating SSAS objects, 244, 250
building phase, MSF, 68–70 186–188 Define Time Intelligence, 243, 245
business intelligence (BI). See creating reports, 612–622 Specify A Unary Operator, 244,
also Business Intelligence creating SSIS packages, 463–495 248–250
Development Studio (BIDS) data mining interface, 360–375 Specify Attribute Ordering, 244,
case studies, 27–33 defined, 155 250
common challenges, 54–56 Dependency Network view, 47 ByAccount aggregate function, 147
common terminology, 11–15 deploying reports to SSRS, 624
CREATE KPI statement 749

C columns
in dimension tables, 121–122
constraints. See precedence
constraints
cache scopes, for queries, 326 in fact tables, 118–119, 146 containers
CacheMode property, 364 variable-width, in data flow default error handling behavior,
Calculated Columns sample metadata, 456–457 499
package, 487 command-line tools generic group, 479
calculated measures, 148 ascmd.exe tool, 157 SSIS control flow, 478–479
calculated members DTEXEC utility, 440 content types
creating in Business Intelligence DTEXECUI utility, 440–441 Continuous, 362
Wizard, 318, 320 DTUTIL utility, 441 Cyclical, 362
creating in cube designer, installed with SQL Server 2008, defined, 361
239–241 157 detecting in Data Mining Wizard,
creating in query designer, 631 rsconfig.exe tool, 604 402
creating using WITH MEMBER rs.exe tool, 609 Discrete, 361
statement, 307 SQLPS.exe tool, 157 Discretized, 362
defined, 175, 307 CommandText property, 598 Key, 362
global vs. local, 631 community technology preview Key Sequence, 362, 363
permanent, creating using BIDS (CTP) version, SQL Server 2008, Key Time, 362, 363
interface, 334–335 40 Ordered, 362
permanent, creating using MDX ComponentMetaData property, 580 support for data types, 363
scripts, 335–336 components. See also Script Table, 362
pros and cons, 241 component Continuous content type, 362
vs. stored measures, 298 compared with tasks, 444, control flow designer
Calculations tab, cube designer, 201, 567–568 Connection Manager window in,
239–242, 334–335 custom, in SSIS, 587–588 468
Capability Maturity Model destination, 485–486, 586–587 Data Flow task, 476, 477–478
Integration (CMMI), 65 in SSIS package data flows, 444 Data Profiling task, 476
CAS (code access security), 648–649 transformation, 486–488 defined, 468
Cash Transform data flow Configuration Manager, Reporting event handling, 500–501
transformation, 486 Services, 102, 108, 155, 607, Execute Process sample, 476–478
change data capture (CDC), 524, 531 609, 737 Execute Process task, 476, 477
Chart control, 638, 643 Configuration Manager, SQL Server, Execute SQL tasks, 476, 476–477,
checkpoints, in SSIS packages 94, 155, 157–158 494
configuring, 506 Configuration Manager, SSRS. Foreach Loop containers, 476, 478
defined, 506 See Configuration Manager, For Loop containers, 478
writing, 507 Reporting Services Sequence containers, 478
child tables, relationship to parent confusion matrix. See classification Task Host containers, 478
table, 5 matrix task overview, 476–478
Children MDX function, 300, 316, connection managers Toolbox window in, 469
321 adding to packages, 473 control flow, in SSIS packages
Choose Toolbox Items dialog box, ADO.NET, 473 building custom tasks, 591–593
591 custom, 588, 594 configuring task precedence,
classification algorithms, 358 defined, 468 480–481
classification matrix, 415–416 Flat File, 474 container types, 478–479
Clean Data Wizard, 705, 706–707 inclusion in Visual Studio package Data Profiling task, 510–513
cloud-hosted data mining, 720–721 designers, 468 event handling, 450–451
Cluster DMX function, 425 ODBC, 473 logging events, 504
Cluster Wizard, Microsoft Visio, OLE DB, 473 Lookup sample, 528
717–718 overview, 448–450 overview, 442–444
ClusterDistance DMX function, 425 Raw File, 474 Script task, 567–568
clustering algorithms, 359 specifying for log providers, 502 copying SSIS packages to deploy,
ClusterProbability DMX function, types, 473–474 552–553
425 using in Script components, Count aggregate function, 147
CMMI (Capability Maturity Model 580–581 counters. See performance counters
Integration), 65 using within Visual Studio, Create A Custom Member Formula,
code access security (CAS), 648–649 473–474 Business Intelligence Wizard,
CodePlex Web site, 37–38, 86, 157 Connections property, 571, 580 244, 251
ConnectionString property, 581 CREATE KPI statement, 348
750 CreateNewOutputRows method

CreateNewOutputRows method, 582 CUBERANKEDMEMBER OLAP CUBESET OLAP function, 683


CRISP-DM life cycle model, function, 683 CUBESETCOUNT OLAP function, 683
399–400, 409 cubes, OLAP CUBEVALUE OLAP function, 683
cross validation, 417–418 adding aggregations, 263 currency conversions, configuring
CTP (community technology assessing source data quality, in Business Intelligence Wizard,
preview) version, SQL Server 516–518 244, 251–254
2008, 40 background, 13 CurrentMember MDX function, 232,
cube browser, 41–42, 201 as BI data structure, 13 313
cube designer BIDS browser, 41–42, 201 custom applications, integrating
accessing Business Intelligence building in BIDS, 198–204 SSIS packages in, 596–600
Wizard, 243 building prototypes, 50 custom foreach enumerators,
Actions tab, 201, 233–239 building sample using Adventure 594–595
Aggregations tab, 201, 262–263, Works, 37–39 custom member formulas, creating
275 configuring properties, 243–254 in Business Intelligence Wizard,
Browser tab, 41–42, 201 connecting to sample using 244, 251
Calculations tab, 201, 239–242, Microsoft Excel, 43–45 custom SSIS objects
334–335 as core of SQL Server BI projects, control flow tasks, 591–593
Cube Structure tab, 201, 201–203 115 data flow components, 588–591,
description, 201 core tools for development, 157 593–594
Dimension Usage tab, 126–128, creating empty structures, 133 deploying, 589–591
134–135, 211–212, 215 as data marts, 13 implementing user interfaces,
KPIs tab, 201, 228, 345 data vs. metadata, 258 593, 594
opening dimension editor, in data warehouses, 11 overview, 587–588
203–204 defined, 9 registering assemblies in GAC, 590
Partitions tab, 201, 264, 278 and denormalization concept, 125 signing assemblies, 589
Perspectives tab, 201, 227 vs. denormalized relational data customer relationship management
tool for building OLAP cubes, stores, 10 (CRM) projects, skills needed for
198–204 deploying, 254–255, 260–261 reporting, 75
Translations tab, 201 designing by using BIDS, 183–223 Cyclical content type, 362
cube partitions dimensions overview, 9–10,
defined, 263 204–210, 257–258
defining, 265–266 fact (measure) modeling, 146–147 D
enabling writeback, 285–286 first, building in BIDS, 218–223 Data Connection Wizard, 672
overview, 263–264 Microsoft Excel as client, 671–684 data dictionaries, 67
for relational data, 268–269 modeling logical design concepts, data flow designer
remote, 270 115–150 advanced edit mode, 484
specifying local vs. remote, 270 vs. OLTP data sources, 54 Calculated Columns sample
in star schema source tables, opening sample in Business package, 487
268–269 Intelligence Development Connection Manager window in,
storage modes, 270 Studio, 39–43 468
and updates, 532 overview of source data options, debugging Script components,
Cube Wizard 115–116 587
building first OLAP cube, 218–223 partitioning data, 263–270 defined, 468
Create An Empty Cube option, as pivot tables, 10–11 destination components, 485–486
199 pivoting in BIDS browser, 42 error handling, 497–498
Generate Tables In The Data presenting dimensional data, Execute Process sample, 482
Source option, 199, 200 138–142 overview, 482–483
launching from Solution Explorer, processing options, 287–291 paths, defined, 484
218 and ROI of BI solutions, 56–58 separate from control flow
populating Dimension Usage tab, skills needed for building, 72, 74 designer, 478
128 star schema source data models, source components, 483–485
Use Existing Tables option, 116–125 specifying Script components,
198–199, 200 UDM modeling, 9–10 573–581
CUBEKPIMEMBER OLAP function, updating, 530–533 and SSIS data viewer capability,
683 using dimensions, 210–217 488–489
CUBEMEMBER OLAP function, 683 viewing by using SSMS Object Toolbox window in, 469
CUBEMEMBERPROPERTY OLAP Browser, 164–168 transformation components,
function, 683 visualizing screen view, 10–11 486–488
data regions, SSRS 751
data flow engine, in SSIS Distribution property, 363 downloading and installing,
asynchronous outputs, 459 DMX query language, 24, 179–180 47–48, 94
basic tasks, 453–454 end-user client applications, 431 installing, 687–688
memory buffers, 454 feature selection, 377–381 Table Analysis Tools group,
metadata characteristics, 454–458 future for client visualization 690–700
overview, 453–454 controls, 720 Data Mining Advanced Query
synchronous outputs, 458, 459 Generic Content Tree viewer, Editor, 704
variable-width column issue, 371–372 Data Mining Extensions. See DMX
456–457 getting started, 396 (Data Mining Extensions) query
data flow, in SSIS packages. See also implementing structures, 399–431 language
data flow engine, in SSIS importance of including Data Mining Model Training
asynchronous component functionality, 53–54 destination, 485, 534–535
outputs, 459 initial loading of structures and Data Mining Query control flow
custom components, 588–591, models, 533–534 task, 535–536
593–594 installing add-ins to Microsoft Data Mining Query data flow
error handling, 451–452 Office 2007, 687–688 component, 427, 428
logging events, 504 Microsoft Cluster viewer, 372 Data Mining Query Task Editor,
Lookup sample, 529–530 Microsoft Excel and end-user 427–428, 536
overview, 444 viewer, 356 Data Mining Query transformation
Script component, 567–568 Microsoft Office 2007 as client, component, 487, 536–537
synchronous component outputs, 687–721 data mining structure designer
458, 459 model viewers in BIDS, 46 choosing data mining model,
data flow maps, 525 Modeling Flags property, 363 365–368
Data Flow task, 476, 477–478, 482 object processing, 429–431 handling nested tables, 364, 366
data lineage, 524 OLAP cubes vs. relational source Mining Accuracy Chart tab, 360,
data marts, 13 data, 401 373–375, 417
data mining prediction queries, 419–426 Mining Model Prediction tab, 360,
adding data mining models to problem of too much data, 55 375, 419, 424
structures using BIDS, 404–406 processing models/objects, Mining Model Viewer tab, 46,
algorithms. See algorithms, data 407–409 360, 368–373, 408, 409
mining queries, 535–537 Mining Models tab, 360, 365–368,
ALTER MINING STRUCTURE Relationship property, 363 404, 404–405
syntax, 366 and ROI of BI solutions, 57–58 Mining Structure tab, 360,
Attribute Characteristics view, role of Microsoft Excel add-ins, 364–365, 404
378, 379 45–47 viewing source data, 364
Attribute Discrimination view, sample Adventure Works cube, Data Mining tab, Microsoft Excel
378, 380 357 Accuracy And Validation group,
Attribute Profiles view, 378 sample Adventure Works 712
background, 14 structure, 46 comparison with Table Tools
BIDS model visualizers, 46–47 skills needed for building Analyze tab, 700–701
BIDS visualizer for, 18 structures, 72, 74 Data Modeling group, 708–712
building objects, 407 software development life cycle Data Preparation group, 705–708
building prototype model, 50 for BI projects, 69 Management group, 701–702
building structures using BIDS, SQL Server Analysis Services 2008 Model Usage group, 702–705
401–404 enhancements, 148–149 Data Mining Wizard, 401–404,
cloud-hosted, 720–721 and SQL Server Integration 405–406
compared with OLAP cubes, 14, Services, 426–428 Data Profiling task
396 tools for object processing, defined, 510
content types, 361–363 429–431 limitations, 512
core tools for development, 157 validating models, 409–418 list of available profiles, 512–513
creating structures by opening viewing sample structures in new in SQL Server 2008, 478
new Analysis Services project in BIDS, 43 profiling multiple tables, 513
BIDS, 401–404 viewing structures by using SSMS viewing output file, 513
creating structures using SSAS, 18 viewers, 164, 168–170 when to use, 510
data types, 361–363 viewing structures using Microsoft data regions, SSRS, defined,
defined, 14 Excel, 47–50 638–655
Dependency Network view, Data Mining Add-ins for Office 2007 Tablix data region, defined,
370–371, 378 639–642
752 Data Source Properties dialog box

Data Source Properties dialog box, DatabaseIntegratedSecurity Descendants MDX function,


614–616 property, 668 318–319, 321
data source views (DSVs) data-first approach to BI design, 130 Description SSIS variable property,
compared with relational views, DataReader destination, 485 491
161 DataReader object, 598 destination components
creating, 199, 201 dataset designer, 618 data flow designer, 485–486
defined, 190 Dataset Properties dialog box, 618 Script-type, 586–587
examining, 190–192 date data type, 363 developers
getting started, 195 date functions, 321–326 IT administrators vs. traditional
making changes to existing tables, debugging developers, 81
193–194 SSIS packages, 471–472, 505–506 keeping role separate from tester
making changes to metadata, SSIS script tasks, 572 role, 81
192–193 using data viewers, 488–489, 506 manager’s role and responsibility
overview, 190–191 decision support systems, 13–14 on development teams, 79–81
as quick way to assess data decision tables responsibility for performing
quality, 518 fast load technique, 528 administrative tasks, 160, 181
required for building OLAP cubes, loading source data into, 525–526 SSAS graphical user interface for
199 Decision Tree view, Microsoft Visio, creating objects, 154
in SSIS, 466, 467 716 types needed for BI projects, 80
data storage containers, skills for Declarative Management development teams
building, 72 Framework (DMF) policies, 95 forming for BI projects, 76–83
data stores, OLAP. See also cubes, DefaultEvents class, 596 optional project skills, 74–76
OLAP Define Account Intelligence, required project skills, 72–74
denormalized, 8 Business Intelligence Wizard, role and responsibility of
as source for decision support 243, 246–247 developer manager, 79–81
systems, 13–14 Define Currency Conversion, role and responsibility of product
data stores, OLTP Business Intelligence Wizard, manager, 78
query challenges, 6–7 244, 251–254 role and responsibility of program
reasons for normalizing, 5–6 Define Dimension Intelligence, manager, 79
relational vs. non-relational, 5 Business Intelligence Wizard, role and responsibility of project
Data Transformation Services (DTS) 243, 250 architect, 78–79
comparison with SSIS, 446, Define Relationships dialog box, role and responsibility of release/
463–464 212–213, 215–216, 216 operations manager, 83
relationship to SSIS, 437–438 Define Semiadditive Behavior, role and responsibility of test
in SQL Server 2000, 546 Business Intelligence Wizard, manager, 81–82
data types 244, 250 role and responsibility of user
Boolean, 363 Define Time Intelligence, Business experience manager, 82–83
content types supported, 363 Intelligence Wizard, 243, 245 roles and responsibilities for
date, 363 degenerate dimension, in fact working with MSF, 76–83
defined, 361 tables, 119 source control considerations,
detecting in Data Mining Wizard, denormalization 111–113
403 and OLAP cube structure, 125 development tools, conducting
double, 363 in OLAP data stores, 8 baseline survey, 86
long, 363 Dependency Network view, deviation analysis, 360
text, 363 Microsoft Visio, 714–715 DimCustomer dimension table
data viewers, SSIS, 488–489, 506 Deploy option, BIDS Solution example, 122, 123
data visualization group, Microsoft Explorer, 260 DimCustomer snowflake schema
Research, 34, 83, 720 deploying example, 134
data vs. metadata, 258 code for custom objects, 589–591 dimension designer
data warehouses reports to SSRS, 623–624 accessing Business Intelligence
background, 12 role and responsibility of release/ Wizard, 243
compared with OLAP, 12 operations managers, 83 Attribute Relationships tab, 141
data marts as subset, 13 SSIS packages, 441, 461–462, Dimension Structure tab, 139
defined, 11 546–558 dimension editor
Microsoft internal, case study, Deployment Progress window, 41 Attribute Relationships tab, 205,
28 Deployment Utility, 556–558 207–209, 223
database snapshots, 507 Deployment Wizard, 155 Browser tab, 205
derived measures, 148 Dimension Structure tab, 205–207
event handler designer 753
opening from cube designer, as starting point for designing domain controllers, conducting
203–204 and building cubes, 204–205 baseline survey, 85
overview, 205 Unified Dimensional Model, 138 double data type, 363
Translations tab, 205, 209 writeback capability, 145 drillthrough actions, 233, 236–238,
dimension intelligence, configuring disconnected BIDS instances, 372–373
in Business Intelligence Wizard, 259–261 DROP KPI statement, 348
243, 250 Discrete content type, 361 .ds files, 108
Dimension Processing destination, Discretized content type, 362 .dsv files, 108
485 Distinct Count aggregate function, DSVs. See data source views (DSVs)
dimension structures, defined, 139 147 DTEXEC utility, 440
dimension tables Distribution property, 363 DTEXECUI utility, 440–441
data vs. metadata, 258 .dll files, 590 DTLoggedExec tool, 600
DimCustomer example, 122, 123 DMF (Declarative Management DTS. See Data Transformation
exploring source table data, 123 Framework) policies, 95 Services (DTS)
as first design step in OLAP DMX (Data Mining Extensions) Dts object, 571–572
modeling, 131 query language DtsConnection class, 597
generating new keys, 122, 146 adding query parameter to .dtsx files, 442, 540, 547
pivot table view, 123 designer, 634–635 DTUTIL utility, 441, 558
rapidly changing dimensions, 144 ALTER MINING STRUCTURE
slowly changing dimensions, syntax, 366
142–144 background, 24 E
space issue, 124 building prediction queries, Enable Writeback dialog box, 286
for star schema, 117–118, 419–421 end users
121–125, 194 Cluster function, 425 decision support systems for
table view, 123 ClusterDistance function, 425 communities, 13–14
types of columns, 121–122 ClusterProbability function, 425 reporting interface
updating, 532–533 defined, 24 considerations, 51
Dimension Usage tab, in cube designer, defined, 627 viewing BI from their perspective,
designer designer, overview, 617 31–50
options, 211–212 including query execution in SSIS Enterprise Manager. See SQL Server
for snowflake schema, 135, 215 packages, 535–537 Enterprise Manager
for star schema, 126–127, 134–135 Predict function, 421, 423–424, enumerators, custom, 588, 594–595
using Cube Wizard to populate, 425 envisioning phase, MSF, 65–67
128 PredictHistogram function, 425 Error and Usage Reporting tool, 155
dimensions prediction functions, 423–426 error conditions, in BIDS, 259
adding attributes, 222–223 PREDICTION JOIN syntax, 422, error handling, in SSIS, 451–452,
adding to OLAP cube design 427 497–499
using Cube Wizard, 220–221 prediction query overview, ETL (extract, transform, and load)
combining with measures to build 421–423 systems
OLAP cubes, 210–214 PredictProbability function, 425 background, 14–15
configuring properties, 243–254 PredictProbabilityStDev function, as BI tool, 515–516
creating using New Dimension 425 defined, 14–15
Wizard, 221 PredictProbabilityVar function, importance of SSIS, 52–53
data vs. metadata, 258 425 for loading data mining models,
enabling writeback for partitions, PredictStDev function, 425 533–537
285–286 PredictSupport function, 425 for loading OLAP cubes, 516–530
hierarchy building, 138–139 PredictTimeSeries function, 425 security considerations, 97–98
non-star designs, 215–217 PredictVariance function, 425 skills needed, 73, 76
presenting in OLAP cubes, queries in SSIS, 426–428 SSIS as platform for, 435–462
138–142 querying Targeted Mailing for updating OLAP cubes,
processing options, 287–291 structure, 633–635 530–533
querying of properties, 329–332 RangeMax function, 425 EvaluateAsExpression SSIS variable
rapidly changing, 144, 284 RangeMid function, 425 property, 491
relationship to cubes, 257–258 RangeMin function, 425 event handler designer
role in simplifying OLAP cube switching from MDX designer to Connection Manager window in,
complexity, 204–205 DMX designer, 633 468
slowly changing, 142–144 templates, 179–180 defined, 468
ways to implement queries, 426 Toolbox window in, 469
754 event handling, in SSIS

event handling, in SSIS, 450–451, extending programmatically, FactResellerSales fact table example,
499–501 732–736 119
events, logging, 501–505 immutability of Excel sheets, 726 fast load technique
Excel. See also Excel Services overview, 724 for loading initial data into
adding Data Mining tab to publishing parameterized Excel dimension tables, 528
Ribbon, 47–48, 50 sheets, 729–732 for loading initial data into fact
adding Table Tools Analyze tab to sample worksheets, 726–729 tables, 527
Ribbon, 47–48, 419 and Web Services API, 732–736 for updating fact tables, 532
Associate sample workbook, ExclusionGroup property, 579 File deployment, for SSIS packages,
48–49 Execute method, 592, 596, 597 547
as client for SSAS data mining Execute Package task, 478 file servers, conducting baseline
structures, 101 Execute Process sample survey, 85
as client for SSAS OLAP cubes, control flow tasks in, 476–478 files-only installation, SSRS, 607
100–101 data flow designer, for SSIS Filter MDX function, 305–307, 308
configuring session-specific package, 482–483 filtering
connection information, 101 installing, 474–475 creating filters on datasets, 637
connecting to sample SSAS OLAP Execute SQL tasks, 476, 476–477, in data mining models, 366
cubes, 43–45 494 source data for data mining
creating sample PivotChart, ExecuteReader method, 597, 598 models, 404–405
679–680 Explore Data Wizard, 705–706 firewalls
creating sample PivotTable, Expression And Configuration conducting baseline survey, 85
678–679 Highlighter, 494 security breaches inside, 97
Data Connection Wizard, 672 Expression dialog box, 637 First (Last) NonEmpty aggregate
Data Mining add-ins, 73, 361, Expression SSIS variable property, function, 147
368, 419 491–492 FirstChild aggregate function, 147
data mining integration expressions Flat File connection manager, 474
functionality, 689–690 adding to constraints, 480–481 Flat File data flow destination, 485
Data Mining tab functionality, in dataset filters, 637 Flat File data flow source, 483
700–712 in SSIS, 447, 493–494 For Loop containers, SSIS, 478
Dependency Network view for Expressions List window, 494 foreach enumerators, custom,
Microsoft Association, 48–49 extracting. See ETL (extract, 594–595
extending, 683–684 transform, and load) systems Foreach Loop containers, 476, 478
Import Data dialog box, 674 extraction history, 524 ForEachEnumerator class, 594
Offline OLAP Wizard, 681–683 forecasting algorithms, 359
as OLAP cube client, 671–684 FREEZE keyword, 342
OLAP functions, 683 F functions, 299–307, 326
as optional component for BI fact columns, in fact tables, Fuzzy Grouping transformation,
solutions, 21 118–119, 146 517–518
PivotTable interface, 675–677 fact dimension (schema), 216 Fuzzy Lookup transformation, 521
PivotTables as interface for OLAP fact modeling, 146–147
cubes, 10–11
popularity as BI client tool,
fact rows, in fact tables, 261
fact tables G
723–724 in Adventure Works cube, 211 Gauge control, 622, 638
Prediction Calculator, 419 data vs. metadata, 258 GetEnumerator method, 594
role in understanding data degenerate dimension, 119 global assembly cache (GAC),
mining, 45–47 exploring source table data, 120, registering assembly files, 590
security for SSAS objects, 100–101 146–147 grain statements, 128–129
skills needed for reporting, 73, 75 FactResellerSales example, 119 granularity, defined, 128
trace utility, 101 fast load technique, 527, 528 GUI (graphic user interface), SQL
viewing data mining structures, loading initial data into, 527–530 Server 2008
47–50 multiple-source, 211 need for developers to master,
viewing SSRS reports in, 649–650 OLAP model design example, 69–70
Excel Calculation Services, 724 131–132 for SSAS developers, 154
Excel data flow destination, 485 pivot table view, 120
Excel data flow source, 483 for star schema, 117–118, 118–121
Excel Services. See also Excel storage space concern, 121
basic architecture, 724–725 types of columns, 118
complex example, 733–736 updating, 532
MDX 755

H templates for, 229–232


viewing in Adventure Works cube,
M
Head MDX function, 316 228, 229 Maintenance Plan Tasks, SSIS, 471
health care industry, business Key Sequence content type, 362, many-to-many dimension (schema),
intelligence solutions, 27–28 363 216–217
hierarchical MDX functions, 316–320 Key Time content type, 362, 363 Matrix data region, 638–655
HOLAP (hybrid OLAP), 267, 279 Kimball, Ralph, 12 Max aggregate function, 147
holdout test sets, 403 KPIs. See key performance MDX (Multidimensional Expressions)
HTTP listener, 607, 607–609 indicators (KPIs) query language
Ancestors function, 319–320
background, 23
I L BottomCount statement, 311
IDBCommand interface, 597 Children function, 300, 316, 321
LastChild aggregate function, 147
IDBConnection interface, 597 core syntax, 296–305
LastChild MDX function, 314, 324
IDBDataParameter interface, 597 creating calculated members, 307
LastPeriods MDX function, 314, 324,
IIf MDX function, 337–338 creating named sets, 308,
325
IIS (Internet Information Services) 338–340
law enforcement, business
conducting baseline survey, 86 creating objects by using scripts,
intelligence solutions, 29
not an SSRS requirement, 606, 309
least-privilege security
608, 610 creating permanent calculated
accessing source data by using,
noting version differences, 86 members, 333–336
96–97, 190
Image data region, 638–655 in cube designer Calculations tab,
configuring logon accounts, 98
Import Data dialog box, 674 240, 241–242
when to use, 70
Import/Export Wizard CurrentMember function, 232, 313
life cycle. See software development
defined, 155 date functions, 321–326
life cycle
role in developing SSIS packages, defined, 23
lift charts, 410–413
439 Descendants function, 318–319,
linked objects, 285
Inmon, Bill, 12 321
list report item, 638–655
Input0_ProcessInputRow method, designer included in Report
load balancing, 270
577, 578 Builder, 644
Load method, 596
Integration Services. See SQL Server Filter function, 305–307, 308
loading. See ETL (extract, transform,
Integration Services (SSIS) functions, 299–307, 326
and load) systems
IntelliSense, 633, 637 Head function, 316
local processing mode, 653, 657
Internet Information Services hierarchical functions, 316–320
localization, 29
conducting baseline survey, 86 IIf function, 337–338
Log Events window, 503
not an SSRS requirement, 606, Internet Sales example, 295,
log locations, 502–503
608, 610 297–299
log providers
noting version differences, 86 and key performance indicators,
custom, 588, 594
iteration 232
overview, 459–460
in BI projects, 62 LastChild function, 314, 324
specifying connection manager,
in OLAP modeling, 132 LastPeriods function, 314, 324, 325
502
Members function, 299–300, 308
logging
native vs. generated, 225
for package execution, 501–505
K question of how much, 504–505
object names, 296
opening query designer, 628
key columns, in fact tables, 118 SSIS log providers, 459–460, 502
OpeningPeriod function, 322–323
Key content type, 362 viewing results, 503
operators, 297
key performance indicators (KPIs) logical modeling, OLAP design
Order function, 302–303
accessing from KPIs tab in cube concepts, 115–150
ParallelPeriod function, 232, 322
designer, 201, 228, 345 logical servers and services
Parent function, 317–318
client-based vs. server-based, 232 conducting baseline survey, 86
PeriodsToDate function, 333
core metrics, 229 considerations, 92–94
query basics, 295
creating, 345–349 service baseline considerations, 94
query designer, 617
customizing, 231–232 long data type, 363
query designer, defined, 627
defined, 15 Lookup data flow transformation,
query designer, overview,
defining important metrics, 55 486
628–631
metadata browser for, 229–231 Lookup sample, using SSIS to load
query templates, 175–178
nesting, 229 dimension and fact tables,
querying dimension properties,
overview, 149, 228–233 528–530
329–332
756 MDX IntelliSense

MDX, continued Microsoft Clustering algorithm, deploying phase, 71


Rank function, 312–314 386–389, 400 development team roles and
SCOPE keyword, 246 Microsoft Decision Trees algorithm responsibilities, 76–83
scripts, 341–343 in classification matrix example, envisioning phase, 65–67
setting parameters in queries, 415 milestones, 62
631–633 defined, 400 planning phase, 67–68
Siblings function, 317 in lift chart example, 412–413 project phases, 62, 65–71
Tail function, 315, 330–331 overview, 381–383 role of iteration, 62
TopCount function, 310 in profit chart example, 414 spiral method, 62, 64
Union function, 320 for quick assessment of source stabilizing phase, 70–71
using with PerformacePoint data quality, 518 Microsoft SQL Server 2008. See
Server 2007, 352–354 viewers for, 369–370 SQL Server 2008
using with SQL Server Reporting Microsoft Distributed Transaction Microsoft SQL Server 2008,
Services (SSRS), 349–351 Coordinator (MS-DTC), 508 Enterprise edition. See
warning about deceptive Microsoft Dynamics, 22, 75 SQL Server 2008, Enterprise
simplicity, 239 Microsoft Excel. See Excel edition
working in designer manual Microsoft Excel Services. See Excel Microsoft Time Series algorithm,
(query) mode, 629, 630–631 Services 383–386, 400
working in designer visual (design) Microsoft Linear Regression Microsoft Visio. See Visio
mode, 629–631 algorithm, 383, 400 Microsoft Visual Studio. See Visual
Ytd function, 294 Microsoft Logistic Regression Studio
MDX IntelliSense, 633 algorithm, 395–396, 400 Microsoft Visual Studio Team
measure columns, in fact tables, Microsoft Naïve Bayes algorithm, System (VSTS)
118–119 376–381, 399, 518 integrating MSF Agile into, 64–65
Measure Group Bindings dialog box, Microsoft Neural Network reasons to consider, 22, 546
213–214 algorithm, 394–395, 400 Team Foundation Server, 111–112
Measure Group Storage Settings Microsoft Office 2007 Microsoft Visual Studio Tools for
dialog box, 278–279 as data mining client, 687–721 Applications (VSTA)
measure groups installing Data Mining Add-ins, debugging scripts, 572
in Adventure Works cube, 211 687–688 defined, 568
creating measures, 211 optional components for BI writing Script component code,
defined, 211 solutions, 21–22 577–582
defining relationship to dimension Microsoft Office SharePoint Server writing scripts, 570–572
data, 212–213 2007. See Office SharePoint Microsoft Visual Studio Tools for the
enabling writeback for partitions, Server 2007 Microsoft Office System (VSTO),
286 Microsoft PerformancePoint Server 683–684
how they work, 211 (PPS) Microsoft Word
relationship to source fact tables, integration with SQL Server as optional component for BI
146–147 Analysis Services, 745 solutions, 21
selecting for OLAP cubes, 219 as optional component for BI viewing SSRS reports in, 649–650
measure modeling, 146–147 solutions, 22 milestones, in Microsoft Solutions
measures, calculated compared with skills needed for reporting, 75 Framework (MSF), 62
derived, 148 using MDX with, 352–354 Min aggregate function, 147
Members MDX function, 299–300, Microsoft Project Server, 22 mining model viewers, 46, 48–49,
308 Microsoft Research, 34, 83, 478, 720 356, 368, 408
memory management, and SSRS, Microsoft Security Assessment Tool, mining structure designer
665–666 95 choosing data mining model,
metadata, data flow Microsoft Sequence Clustering 365–368
characteristics, 454–458 algorithm, 389–390, 400 handling nested tables, 364, 366
how SSIS uses, 458 Microsoft Solutions Framework Mining Accuracy Chart tab, 360,
variable-width column issue, (MSF). See also MSF (Microsoft 373–375, 417
456–457 Solutions Framework) for Agile Mining Model Prediction tab, 360,
metadata vs. data, 258 Software Development 375, 419, 424
Microsoft Association algorithm, Agile Software Development Mining Model Viewer tab, 46,
391–393, 400 version, 63–65 360, 368–373, 408, 409
Microsoft Baseline Security alternatives to, 62 Mining Models tab, 360, 365–368,
Analyzer, 95 building phase, 68–70 404, 404–405
Microsoft Biztalk Server, 22 defined, 62
OLAP cubes 757
Mining Structure tab, 360, developer skills, 80, 81 compared with data warehousing,
364–365, 404 skills needed for custom client 12
viewing source data, 364 reporting, 75 defined, 8
Model Designer, SSRS, backing up for SQL Server Integration Microsoft Excel functions, 683
files, 108 Services, 442 modeled as denormalized, 8
model training. See Data Mining using code in SSRS reports, when to use, 8
Model Training destination 647–649 working offline in Microsoft Excel,
modeling using to develop custom SSIS 681–683
logical modeling, OLAP design objects, 587–588, 588 OLAP cubes
concepts, 115–150 network interface cards (NICs), adding aggregations, 263
OLAP. See OLAP modeling conducting baseline survey, 85 assessing source data quality,
OLTP modeling, 115, 137–138 New Cube Wizard, 134 516–518
physical, for business intelligence New Table Or Matrix Wizard, background, 13
solutions, 4 644–646 as BI data structure, 13
Modeling Flags property, 363 non-relational data, defined, 5 BIDS browser, 41–42, 201
MOLAP (multidimensional OLAP), normalization building in BIDS, 198–204
267, 279 implementing in relational data building prototypes, 50
MS-DTC (Microsoft Distributed stores, 5 building sample using Adventure
Transaction Coordinator), 508 reasons for using, 6–7 Works, 37–39
MsDtsSrvr.ini.xml, SSIS configuration view of OLTP database example, 5 configuring properties, 243–254
file, 112 connecting to sample using
MSF (Microsoft Solutions Microsoft Excel, 43–45
Framework) for Agile Software O as core of SQL Server BI projects,
Development Object Explorer 115
background, 63–65 defined, 39 core tools for development, 157
built into Microsoft Visual Studio viewing SSAS objects from SSMS, creating empty structures, 133
Team System, 64–65 160 as data marts, 13
defined, 63 viewing SSIS objects from SSMS, data vs. metadata, 258
development team roles and 438, 439 in data warehouses, 11
responsibilities, 77 object viewers, 164 defined, 9
project phases, 65–71 ODBC connection manager, 473 and denormalization concept, 125
suitability for BI projects, 64 Office 2007 vs. denormalized relational data
MSF (Microsoft Solutions as data mining client, 687–721 stores, 10
Framework) for Capability installing Data Mining Add-ins, deploying, 254–255, 260–261
Maturity Model Integration 687–688 designing by using BIDS, 183–223
(CMMI), 65 optional components for BI dimensions overview, 9–10,
.msi files, 38 solutions, 21–22 204–210, 257–258
MSReportServer_ Office SharePoint Server 2007 fact (measure) modeling, 146–147
ConfigurationSetting class, 668 configuration modes for working first, building in BIDS, 218–223
MSReportServer_Instance class, 668 with SSRS, 737–738 Microsoft Excel as client, 671–684
multidimensional data stores. See integrated mode, installing SSRS modeling logical design concepts,
cubes, OLAP add-in for, 742–743 115–150
Multidimensional Expressions. native mode, integration of SSRS vs. OLTP data sources, 54
See MDX (Multidimensional with, 740–741 opening sample in Business
Expressions) query language as optional component for BI Intelligence Development
solutions, 21–22 Studio, 39–43
overview of source data options,
N Report Center, 744–745
skills needed for reporting, 75 115–116
Name SSIS variable property, 492 SQL Server business intelligence partitioning data, 263–270
named sets, 241, 308, 338–340 and, 723–745 as pivot tables, 10–11
Namespace SSIS variable property, SSRS and, 604, 736–745 pivoting in BIDS browser, 42
492 template pages, 744–745 presenting dimensional data,
natural language, 67 Windows SharePoint Services, 94 138–142
.NET API Offline OLAP Wizard, 681–683 processing options, 287–291
application comparison with SSIS OLAP (online analytical processing) and ROI of BI solutions, 56–58
projects in Visual Studio, 467 characteristics, 8 skills needed for building, 72, 74
and compiled assemblies, compared with data mining, 14 star schema source data models,
196–197 116–125
758 OLAP modeling

OLAP cubes, continued Ordered content type, 362 developing in Visual Studio,
UDM modeling, 9–10 Outliers Wizard, 706–707 464–472
updating, 530–533 overtraining, data model, 535 documentation standards, 525
using dimensions, 210–217 Encrypt Sensitive With Password
viewing by using SSMS Object encryption option, 563
Browser, 164–168 P Encrypt Sensitive With User Key
visualizing screen view, 10–11 Package Configuration Wizard, encryption option, 563
OLAP modeling 549–550 encrypting, 554–556
compared with OLTP modeling, Package Configurations Organizer encryption issues, 563
115 dialog box, 548–549, 551–553 error handling, 451–452
compared with views against package designer event handling, 450–451, 499–501
OLTP sources, 137–138 adding connection managers to executing by using DTEXEC utility,
as iterative process, 132 packages, 473 440
naming conventions, 150 best design practices, 509–510 executing by using DTEXECUI
naming objects, 132 Connection Manager window in, utility, 440–441
need for source control, 132 468 expressions, 447
role of grain statements, 128–129 control flow designer, 468 external configuration file,
tools for creating models, data flow designer, 468 548–552
130–132, 149–150 debugging packages, 471–472 file copy deployment, 552–553
using Visio 2007 to create models, event handler designer, 468 handling sensitive data, 563
130–132, 133 executing packages, 471–472 keeping simple, 508, 509
OLE DB connection manager, 473 how they work, 470–472 logical components, 442–452
OLE DB data flow destination, 485 navigating, 479 physical components, 442
OLE DB data flow source, 483, 484 overview, 467–469 role of SSMS in handling, 438–439
OLTP (online transactional Toolbox window in, 469 saving results of Import/Export
processing) viewing large packages, 479 Wizard as, 439
characteristics, 6 Package Explorer, 468–469 scheduling execution, 558–559
defined, 5 Package Installation Wizard, security considerations, 97–98,
normalizing data stores, 5 557–558 559–562
querying data stores, 6–7 Package Store File deployment, for setting breakpoints in, 505–506
OLTP modeling SSIS packages, 547 source control considerations, 112
compared with OLAP modeling, Package Store MSDB deployment, as SSIS principal unit of work,
115 for SSIS packages, 547 436, 438
compared with OLAP views, packages, in SSIS and SSIS runtime, 452
137–138 adding checkpoints to, 506–507 tool and utilities for, 438–441
OLTP table partitioning, 268–269 adding to custom applications, upgrading from earlier versions of
OnError event, 451 596–600 SQL Server, 440
OnExecStatusChanged event backups and restores, 107–108 variables, 445–447
handler, 500 best practices for designing, where to store, 98, 112
OnInit method, 648 509–510 PacMan (SSIS Package Manager),
online analytical processing. configurations, 461 600
See OLAP (online analytical configuring transactions in, parallel processing, in SQL Server
processing) 507–508 2008, Enterprise edition, 269
online transactional processing. connection managers, 448–450 ParallelPeriod MDX function, 232,
See OLTP (online transactional control flow, 442–444 322
processing) control flow compared with data Parent MDX function, 317–318
OnPostExecute event handler, 500, flow, 444–445 parent tables, relationship to child
500–501 creating with BIDS, 463–495 tables, 5
OnProgress event handler, 500 data flow, 444 Partition Processing destination, 485
OnVariableValueChanged event debugging, 505–506 Partition Wizard, 265–266
handler, 500 default error handling behavior, partitions, cube
OpeningPeriod MDX function, 498 defined, 263
322–323 defined, 20 defining, 265–266
operating environment, conducting deploying and managing by using enabling writeback, 285–286
baseline survey, 86, 87–88 DTUTIL utility, 441, 558 overview, 263–264
optional skills, for BI projects, 74–76 deployment options, 461–462, for relational data, 268–269
Oracle, 5, 19 546–558 remote, 270
Order MDX function, 302–303 specifying local vs. remote, 270
query designer 759
in star schema source tables, target location options for processing layer, security
268–269 deploying SSIS packages, considerations, 97–98
storage modes, 270 547–548 processing time, 270
and updates, 532 typical initial BI installation, 90 ProcessInput method, 583, 584–585,
partitions, table, 268–269 pie chart, adding to Microsoft Excel 585, 586
Pasumansky, Mosha, 340, 633 sheet, 729–731 ProcessInputRow method, 584,
performance counters PivotCharts, Microsoft Excel 585, 586, 587. See also Input0_
possible problems to document, adding to workbooks, 679–680 ProcessInputRow method
88 creating views, 44 processors, multiple, 454, 523
role in creating baseline PivotTable Field List, 650–651, 652 ProClarity, 22
assessment of operating PivotTables, Microsoft Excel product managers
environment, 87–88 connecting to sample cubes, job duties, 78
Performance Visualization tool, 510 43–44 role and responsibility on
PerformancePoint Server creating, 678–679 development teams, 78
integration with SQL Server creating PivotChart views, 44 profit charts, 413–414
Analysis Services, 745 dimensional information, 678–679 program managers
as optional component for BI formatting, 679 job duties, 79
solutions, 22 as interface for cubes, 10–11 role and responsibility on
skills needed for reporting, 75 overview, 675–680 development teams, 79
using MDX with, 352–354 ways to pivot view of data, 44 project architects, role and
PeriodsToDate MDX function, 333 planning phase, MSF, 67–68 responsibility on development
permissions, for SSRS objects, PostExecute method, 577, 578, 587 teams, 78–79
103–104 PostLogMessage method, 580 Project Real, 28
perspectives PPS. See Microsoft Project Server, 22
compared with relational views, PerformancePoint Server (PPS) Propagate variable, 501
227 precedence constraints, 480–481 prototypes, building during MSF
defined, 149 Predict DMX function, 421, 423–424, planning phase, 68
overview, 227–228 425, 427 proxy accounts, 563
phases, in Microsoft Solutions PredictHistogram DMX function, 425 proxy servers, conducting baseline
Framework (MSF), 62 prediction algorithms, 359 survey, 85
physical infrastructure Prediction Calculator, Microsoft
assessing servers needed Excel, 419
for initial BI development prediction functions, 423–426 Q
environment, 88 PREDICTION JOIN syntax, 422, 427 queries. See also DMX (Data Mining
conducting baseline survey, 85–87 predictive analytics, 148–149, 355, Extensions) query language;
planning for change, 85–89 366, 426 MDX (Multidimensional
physical modeling, for business Predictive Model Markup Language Expressions) query language
intelligence solutions, 4 (PMML), 409 cache scopes, 326
physical servers PredictProbability DMX function, challenges in OLTP data stores,
assessing number needed 425 6–7
for initial BI development PredictProbabilityStDev DMX creating in report designer,
environment, 88–89 function, 425 616–618
for business intelligence solutions, PredictProbabilityVar DMX function, creating named sets, 338–340
4 425 creating permanent calculated
conducting baseline survey, 85 PredictStDev DMX function, 425 members, 333–336
considerations, 91–92 PredictSupport DMX function, 425 manually writing, 54–55
consolidation, 92 PredictTimeSeries DMX function, MDX basics, 295
determining optimal number 425 optimizing, 326
and placement for initial BI PredictVariance DMX function, 425 query browsers
development environment, PreExecute method, 577, 587 Cube filter, 175
89–94 proactive caching Measure Group filter, 175
development server vs. test fine tuning, 283–284 query designer
server, 91 notification settings, 282 multiple types, 616–618
installing SQL Server, 90 overview, 279–282 for reports, 627–638
installing SSAS, 90 Process Cube dialog box, 288–289 setting parameters in queries,
installing SSRS, 90 Process Progress dialog box, 407, 631–633
429–430
760 query languages

query languages, 23–25. See also role and responsibility on reporting-type actions, 233,
DMX (Data Mining Extensions) development teams, 83 235–236
query language; MDX remote partitions, 270 reports. See also report processing
(Multidimensional Expressions) remote processing mode, 653, systems; SQL Server Reporting
query language; XMLA (XML 656–657 Services (SSRS)
for Analysis) query language Report Builder adding custom code to, 647–651
Query Parameters dialog box, 632 creating report, 644–646 authentication credential flow in
query templates defined, 19, 156 SSRS, 103
for DMX (Data Mining Extensions) user interface, 643 building for SSRS, 627–646
query language, 179–180 version issues, 604, 643 cleaning and validating data for,
execution process, 174–175 Report Data window, 618 55
for MDX (Multidimensional Report Definition Language (RDL) client interface considerations, 51
Expressions) query language, creating metadata, 635 considering end-user audiences,
175–178 defined, 24–25 51
for XMLA (XML for Analysis) query role in report building, 621, 623, creating by using New Table Or
language, 180 624 Matrix Wizard, 644–646
query tuning, 276 version issues, 624 creating in SSRS, 603–624
report designer creating with BIDS, 612–622
adding report items to designer defining data sources, 613–614
R surface, 638–639 defining project-specific
RaiseChangedEvent SSIS variable backing up files, 108 connections, 627
property, 492 building tabular report, 619–620 deploying to SSRS, 623–624
RangeMax DMX function, 425 creating queries, 616–618 query designers for, 627–638
RangeMid DMX function, 425 fixing report errors, 621 samples available, 622
RangeMin DMX function, 425 illustrated, 614 setting parameters in queries,
Rank MDX function, 312–314 opening, 614 631–633
rapidly changing dimensions, 144 previewing reports, 620 Toolbox for, 621–622, 638
Raw File connection manager, 474 tabular reports, 619–620 using Tablix data region to build,
Raw File data flow destination, 485, types of report data containers, 640–642
507 618 viewing in Microsoft Excel, 650
Raw File data flow source, 483, using Tablix data region to build viewing in Word, 649–650, 650
484–485 reports, 640–642 Reports Web site, 604
RDBMS (relational database version enhancements, 19, 614 ReportViewer
management systems) working with MDX query results, control features, 652–656
conducting baseline survey of 635–643 embedding custom controls,
servers, 85 Report Manager Web site, 604, 609 652–656
defined, 5 report models, 660–662 entering parameter values,
SQL Server data as source data, 19 report processing modes, 653 656–657
RDL. See Report Definition report processing systems, 15. See security credentials, 657–658
Language (RDL) also SQL Server Reporting required skills, for BI projects, 72–74
.rdl files, 108, 112–113 Services (SSRS) restores. See backups and restores
.rds files, 108 Report Project Property Pages ROLAP (relational OLAP)
ReadOnlyVariables property, 577 dialog box, 623–624 dimensional data, 284–285
ReadWriteVariables property, 578 Report Properties dialog box, in Measure Group Storage
Recordset destination, 485 647–648 Settings dialog box, 279
rectangle report item, 639–655 Report Server Web service overview, 267–268
regression algorithms, 359 authentication for access, roles, in SSAS, 195–196
Re-label Wizard, 706, 707 610–611 .rptproj files, 108
relational data authoring and deploying reports, rsconfig.exe tool, 604
defined, 5 738–740 rs.exe tool, 609
normalizing, 5 defined, 603–604 Rsmgrpolicy.config file, 108
partitioning, 268–269 job manager, 612 Rsreportserver.config file, 108
tables for denormalizing, 8 overview, 609–610 Rssvrpolicy.config file, 108
Relationship property, 363 Reporting Services. See SQL Server runtime, SSIS, 452
ReleaseConnections method, 581 Reporting Services (SSRS)
release/operations managers Reportingservicesservice.exe.config
job duties, 83 file, 108
SQL Server 2008 761

S custom client considerations,


104–106
Solution Explorer, in BIDS, 40, 46,
184, 186–188
Sample Data Wizard, 705, 707–708 encrypting packages when Solution Explorer, Visual Studio
Save Copy Of Package dialog box, deploying, 554–556 configuring SSAS object
554–556 handling sensitive SSIS package properties, 243
scalability, and SSRS, 662–664 data, 563 data sources and data source
scaling out, in SQL Server 2008, overview of SSIS package issues, views, 466, 467
Enterprise edition, 666–667 559–562 SSIS Packages folder, 466–467
schema-first approach to BI design, passing report credentials viewing SSIS projects in, 466–467
130 through ReportViewer, 657–658 solutions, defined, 98, 539
SCOPE keyword, MDX query proxy execution accounts for SSIS SOLVE_ORDER keyword, 343–344
language, 246, 341–342 packages, 79 source code control/source control
Scope SSIS variable property, 492 Security Assessment Tool, 95 systems, 540–542
Script component. See also Script security requirements source control, 111–113, 132
Transformation Editor dialog in development environment, 70 source data
box overview, 95–106 accessing by using least-
compared with Script task, Select Script Component type privileged accounts, 96–97
567–568 dialog box, 573–574 cleaning, validating, and
connection managers in, 580–581 semiadditive behavior, configuring consolidating, 69
as data source, 573 in Business Intelligence Wizard, collecting connection information,
debugging, 587 244, 250 96–97
destination-type, 586–587 sequence analysis algorithms, 359 loading into decision tables,
selecting Transformation type Sequence containers, 478 525–526
option, 574 servers. See logical servers and non-relational, 5
source-type, 582 services; physical servers performing quality checks before
synchronous and asynchronous service level agreements (SLAs) loading mining structures and
transformation, 582–586 availability strategies, 87 models, 533–534
type options, 573–574 conducting baseline survey, 87 querying OLTP data stores, 6–7
writing code, 577–581 reasons to create in BI projects, 87 relational, 5
Script task Service Principal Name (SPN), 159 structure names, 68
compared with Script component, Shared Data Source Properties transformation issues, 519–523
567–568 dialog box, 613–614 source data systems, upgrading to
defined, 478 SharePoint Server. See Office SQL Server 2008, 89
using to define scripts, 568–570 SharePoint Server 2007 Specify A Unary Operator, Business
Script Task Editor dialog box, Siblings MDX function, 317 Intelligence Wizard, 244,
568–570 signing assemblies, 589 248–250
Script Transformation Editor dialog skills, for BI projects Specify Attribute Ordering, Business
box optional, 74–76 Intelligence Wizard, 244, 250
Connection Managers page, 580 required, 72–74 spiral method, 62, 64
Inputs And Outputs page, 576, .sln files, 98 SQL Server 2000, upgrading
578–579, 579 backing up, 108 production servers for SSAS,
Input Columns page, 574–576 Slowly Changing Dimension SSIS, and SSRS, 90
opening, 574 transformation, 533 SQL Server 2008
scripting Slowly Changing Dimension Wizard, command-line tools installed, 157
limitations, 587 SSIS, 143–144 complete installation, 156–157
Script task compared with Script slowly changing dimensions (SCD), as core component of Microsoft
component, 567–568 142–144 BI solutions, 16, 19
for SSRS administrative tasks, .smdl files, 108, 112–113 customizing data display, 3
667–669 snapshots, database, 507 Database Engine Tuning
ScriptLanguage property, 568 snowflake schema Advisor, 8
scripts, MDX, 341–343 DimCustomer example, 134 documenting sample use, 86
security. See also least-privilege on Dimension Usage tab of cube downloading and installing Data
security designer, 135, 215 Mining Add-ins for Office 2007,
best practices, 70, 564 overview, 134 47–48
BIDS, for solutions, 98–99 when to use, 136–137 downloading and installing
BIDS, when creating SSIS software development life cycle, sample databases, 154
packages, 98 61–71 feature differences by edition,
37, 58
762 SQL Server 2008, Enterprise edition

SQL Server 2008, continued providers for star schema source considering where to install, 89
installing sample databases, 37–41 data, 117 creating ETL packages, 55
installing samples in development query designers, using for in custom applications, 596–600
environment, 86 creating reports, 627–638 custom task and component
minimum-installation paradigm, querying objects in SSMS, development, 587–595
153–154 170–175 data flow engine, defined, 436
new auditing features, 111 reasons for installing SQL Server data flow engine, overview,
online transactional processing, Database Engine Services with, 453–459
5–8 153 and data mining by DMX query,
security features, 70 as requirement for OLAP BI 426–428
SQL Server 2008, Enterprise edition solutions, 23 data mining object processing,
parallel processing, 269 roles in, 195–196 430
scaling out, 666–667 scaling to multiple machines, 91 defined, 20
SQL Server Agent, 558–564 security considerations, 98–99 documenting service logon
SQL Server Analysis Services source control considerations, account information, 94
(SSAS). See also SQL Server 112, 113 error handling, 497–499
Management Studio (SSMS); SSMS as administrative interface, ETL skills needed, 76
SQL Server Profiler 16 event handling, 499–501
aggregation types, 147 using compiled assemblies with history, 437–438
background, 16–17 objects, 196–197 as key component in Microsoft BI
backups and restores, 106–107 using OLAP cubes vs. OLTP data solutions, 20
baseline service configuration, sources, 54 log providers, 459–460
157–159 viewing configuration options in mastering GUI, 69–70
BIDS as development interface, 16 SSMS, 93 MsDtsSrvr.ini.xml configuration
BIDS as tool for developing cubes, viewing SSAS objects in Object file, 112
16 Explorer, 160 .NET API overview, 442
building sample OLAP cube, viewing what is installed, 153–154 object model and components,
37–39 working on databases in BIDS in 442–452
considering where to install, 89 connected mode, 261 object model, defined, 436
as core component of Microsoft SQL Server Books Online, defined, package as principal unit of work,
BI solutions, 16 156 436, 438
core tools, 153–181 SQL Server Compact destination, performance counters for, 87
creating data mining structures, 485 relationship to Data
18 SQL Server Configuration Manager, Transformation Services, 437
creating roles, 195–196 94, 155, 157–158 runtime, defined, 436
credentials and impersonation SQL Server Database Engine, 108 runtime, overview, 452
information for connecting to SQL Server Database Engine Tuning scaling to multiple machines, 91
data source objects, 98–99 Advisor, 8, 156 scripting support, 567–587
Cube Wizard, 128 SQL Server destination, 485 security considerations for
data source overview, 188–190 SQL Server Enterprise Manager, packages, 97–98
data source views (DSVs), 190–195 463–464 service, defined, 436
database relationship between SQL Server Error and Usage Slowly Changing Dimension
cubes and dimensions, 257–258 Reporting, defined, 155 Wizard, 143–144
default assemblies, 197 SQL Server Installation Center, solution and project structures,
defined, 16 defined, 156 539–540
deploying Adventure Works 2008 SQL Server Integration Services source control considerations, 112
to, 41 (SSIS) SSMS as administrative interface,
Deployment Wizard, 155 architectural components, 16
dimension design in, 140–142 435–462 upgrading packages from earlier
documenting service logon architectural overview, 436–438 versions of SQL Server, 440
account information, 94 backups and restores, 107–108, ways to check data quality,
exploring fact table data, 120 112 516–518
installing, 153 BIDS as development interface, 16 SQL Server Management Studio
installing multiple instances, 90 BIDS as tool for implementing (SSMS)
linked objects, 285 packages, 20 backups of deployed SSAS
logon permissions, 98 comparison with Data solutions, 106–107
mastering GUI, 69–70 Transformation Services, 446, connecting to SSAS in, 160
performance counters for, 87 463–464
taxonomies 763
as core tool for developing defined, 19 conceptual view, 125
OLAP cubes and data mining deploying reports, 623–624 for denormalizing, 30
structures in SSAS, 157 documenting service logon dimension tables, 117–118,
data mining object processing, account information, 94 121–125, 194
431 feature differences by edition, 606 Dimension Usage tab, in cube
defined, 19, 155, 160 installing add-in for Microsoft designer, 126–127, 134–135
menus in, 161 Office SharePoint Server, fact tables, 117–118, 118–121, 194
object viewers, 164 742–743 Microsoft changes to feature,
opening query editor window, installing and configuring, 210–211
295 606–612 moving source data to, 525–530
processing OLAP objects, 163 integration with Office SharePoint for OLAP cube modeling, 116–125
querying SSAS objects, 170–175 Server, 604 on-disk storage, 116–117
role in handling SSIS packages, mastering GUI, 69–70 physical vs. logical structures,
438–439 performance counters for, 87 116–117
verifying Adventure Works query design tools, 616–618 reasons to create, 126, 126–127
installation, 39 Report Builder, 19 tables vs. views, 116–117
viewing configuration options and scalability, 662–664 visualization, 117–118
available for SSAS, 93 scaling to multiple machines, 91 storage area networks, 91
viewing data mining structures, security decisions during Subreport data region, 638–655
164, 168–170 installation and setup, 102–104 Sum aggregate function, 147
viewing dimensions, 163 skills needed for reporting, 73, 75 Synchronize command, to back up
viewing OLAP cubes, 164–168 source control considerations, and restore, 107
viewing OLAP objects, 162–164 112–113 synchronous data flow outputs,
working with SSIS Service, SSMS as administrative interface, 458, 459
564–565 16 synchronous transformation,
SQL Server Profiler storing metadata, 604 583–586
as core tool for developing using in SharePoint integrated SynchronousInputID property,
OLAP cubes and data mining mode, 742–743 578–579
structures in SSAS, 157 using in SharePoint native mode, system variables, in SSIS, 445–446,
defined, 156 740–741 493
how query capture works, using MDX with, 349–351
172–174 Web site interface, 19
overview, 171–172 Windows Management T
role in designing aggregations, Instrumentation, 668–669 Table content type, 362
275–277 SQLPS.exe tool, 157 Table data region, 638–655
using for access auditing, 109–110 SSAS. See SQL Server Analysis table partitioning, defined, 269. See
SQL Server Reporting Services Services (SSAS) also OLTP table partitioning
(SSRS) SSAS Deployment Wizard, 155 tables
adding custom code to reports, SSIS. See SQL Server Integration parent vs. child, 5
647–649 Services (SSIS) relational, for denormalizing, 8
architecture, 603–605 SSIS Package Manager (PacMan), Tablix container, defined, 622
authentication credential flow for 600 Tablix data region, defined, 639–642
reports, 103 SSIS Package Store, 552, 554, 564 tabular report designer, 619–620
background, 19 SSIS Package Upgrade Wizard, 440 Tail MDX function, 315, 330–331
backups and restores, 108 SSIS Performance Visualization tool, Task class, 591–592
BIDS as development interface, 16 510 Task Host containers, 478
building reports for, 627–646 SSIS Service, 564–565 tasks
command-line utilities, 604 SSMS. See SQL Server Management compared with components, 444,
Configuration Manager, 102, 108, Studio (SSMS) 445
155, 607, 609, 737 SSRS. See SQL Server Reporting custom, 587–588
configuring environment for Services (SSRS) default error handling behavior,
report deployment, 623–624 SSRS Web Services API, 658–659 499
configuring with Office SharePoint stabilizing phase, MSF, 70–71 in SSIS package control flow,
Server, 737–738 staging databases, when to use, 442–444
considering where to install, 89 520–523, 524, 531 taxonomies
as core component of Microsoft star schema documenting, 67–68
BI solutions, 16 comparison with non-star designs, role in naming of OLAP objects,
creating reports, 603–624 215–217 132
764 Team Foundation Server

Team Foundation Server, 38, 540


teams. See development teams
U opening Variables window, 490
overview, 445–447
Template Explorer, 174, 422 UDM (Unified Dimensional Model), properties, 491–493
test managers 9–10, 138 RaiseChangedEvent property, 492
job duties, 81 unary operators, specifying in Scope property, 492
keeping role separate from Business Intelligence Wizard, system, 493
developer role, 81 244, 248–250 Value property, 492–493
role and responsibility on Unified Dimensional Model (UDM), ValueType property, 493
development teams, 81–82 9–10, 138 ways to use, 494–495
testing. See stabilizing phase, MSF Union MDX function, 320 variable-width columns, in data flow
testing plans, 70–71 Upgrade Package Wizard, 487 metadata, 456–457
text data type, 363 upgrading SSIS packages from Virtual PC, setting up test
Textbox data region, 638–655 earlier versions of SQL Server, configurations, 37
This function, 342 440 virtualization, 4, 91
time intelligence, configuring in URLs (uniform resource locators) Visio
Business Intelligence Wizard, enhanced arguments, 651 adding Data Mining template,
243, 245 implementing access, 651–652 47–48
Toolbox, SSIS Usage-Based Optimization Wizard, data mining integration, 714–718
adding objects to, 591 274–275 data mining integration
overview, 469 user experience managers, role and functionality, 689–690
Toolbox, SSRS, in BIDS, 621–622, 638 responsibility on development Data Mining toolbar, 715–718
Toolbox, Visual Studio, 652, 654 teams, 82, 82–83 as optional component for BI
tools user interfaces (UIs) solutions, 21
ascmd.exe tool, 157 role and responsibility of user using to create OLAP models,
BIDS Helper tool, 255, 490, 494, experience managers, 82–83 130–132, 133
510 skills needed for creating, 73, 75 Vista. See Windows Vista, IIS version
downloading from CodePlex Web utilities differences
site, 37–38, 86, 157 ascmd.exe tool, 157 Visual SourceSafe (VSS)
DTEXEC utility, 440 BIDS Helper tool, 255, 490, 494, checking files in and out, 544–545
DTEXECUI utility, 440–441 510 creating and configuring VSS
DTUTIL utility, 441 downloading from CodePlex Web database, 541–542
installed with SQL Server 2008, site, 37–38, 86, 157 creating and configuring VSS user
157 DTEXEC utility, 440 accounts, 542
rsconfig.exe tool, 604 DTEXECUI utility, 440–441 History dialog box, 545
rs.exe tool, 609 DTUTIL utility, 441 Lock-Modify-Unlock Model
SQLPS.exe tool, 157 installed with SQL Server 2008, option, 541–542
TopCount MDX function, 310 157 overview, 540
Trace Properties dialog box, 276 rsconfig.exe tool, 604 storing solution files, 542–544
Tracer utility, Microsoft Excel, 101 rs.exe tool, 609 using Add SourceSafe Database
transactional activities. See SQLPS.exe tool, 157 Wizard, 541–542
OLTP (online transactional Visual Studio. See also Solution
processing) Explorer, Visual Studio
transactions, package, 507–508 V Adventure Works.sln file, 38, 40
Transact-SQL Validate method, 592, 594 embedding custom ReportViewer
aggregate functions, 9 Value SSIS variable property, controls, 652–656
queries, 54–55 492–493 as location for SSIS package
transformation components, ValueType SSIS variable property, development, 464–472
486–488 493 relationship to BIDS, 16–17, 22,
transformations, built-in, 578 variables, in SSIS 41, 463
transforming. See ETL (extract, adding to packages, 490 relationship to SQL Server
transform, and load) systems Description property, 491 Integration Services, 440
translations differences related to SSIS resemblance to BIDS interface,
for cube metadata, 225–226 platform, 490–493 157
SSAS, defined, 149 EvaluateAsExpression property, signing custom object assemblies,
491 589
Expression property, 491–492 SSIS menu, 472
Name property, 492 SSIS package designers, 467–472
Namespace property, 492 Toolbox, 652, 654
Ytd MDS function 765
usefulness in BI development, 86 Windows Reliability and storing changes to dimensions,
viewing new SSIS project template Performance Monitor tool, 87 285–286
in, 465–466 Windows Server 2003, IIS version
VSTO (Microsoft Visual Studio differences, 86
Tools for the Microsoft Office Windows Server 2008 X
System), 683–684 IIS version differences, 86 XML data flow source, 483
VSTS (Visual Studio Team System) Performance Monitor counters, XML for Analysis. See XMLA (XML
integrating MSF Agile into, 64–65 523 for Analysis) query language
reasons to consider, 22, 546 Reliability and Performance XMLA (XML for Analysis) query
Team Foundation Server, 111–112 Monitor tool, 87 language
virtualization improvements, 4 background, 24
Windows SharePoint Services and
W Office SharePoint Server 2007,
defined, 24
query templates, 180
Warnings tab, in BIDS, 259 94 source control considerations, 113
waterfall method, 62 Windows Vista, IIS version using for data mining object
Web Parts, 21–22, 740–741 differences, 86 processing, 431
Web servers, conducting baseline Windows-on-Windows 32-bit viewing scripts, 164
survey, 85 applications, 91
Web Services API, and Excel Word 2007
Services, 732–736 as optional component for BI
solutions, 21
Y
Windows Communication Ytd MDS function, 294
Foundation (WCF), 732–733 viewing SSRS reports in, 649–650
Windows Management writeback
Instrumentation (WMI), defined, 98
668–669 overview, 145
About the Authors
Several authors contributed chapters to this book.

Lynn Langit
Lynn Langit is a developer evangelist for Microsoft. She works mostly on the West Coast of
the United States. Her home territory is southern California. Lynn spends most of her work
hours doing one of two things: speaking to developers about the latest and greatest tech-
nology, or learning about new technologies that Microsoft is releasing. She has spoken at
TechEd in the United States and Europe as well as at many other professional conferences
for developers and technical architects. Lynn hosts a weekly webcast on MSDN Channel 9
called “geekSpeak.” She is a prolific blogger and social networker. Lynn’s blog can be found at
http://blogs.msdn.com/SoCalDevGal.

Prior to joining Microsoft in 2007, Lynn was the founder and lead architect of her own
company. There she architected, developed, and trained technical professionals in building
business intelligence (BI) solutions and other .NET projects. A holder of numerous Microsoft
certifications—including MCT, MCITP, MCDBA, MCSD.NET, MCSE, and MSF—Lynn also has
10 years’ experience in business management. This unique background makes her particu-
larly qualified to share her expertise in developing successful real-world BI solutions using
Microsoft SQL Server 2008. This is Lynn’s second book on SQL Server business intelligence.

In her spare time, Lynn enjoys sharing her love of technology with others. She leads
Microsoft’s annual DigiGirlz day and camp in southern California. DigiGirlz is a free educa-
tional program targeted at providing information about careers in technology to high-school
girls. Lynn also personally volunteers with a group of technical professionals who provide
support to a team of local developers building and deploying an electronic medical rec-
ords system (SmartCare) in Lusaka, Zambia. For more information about this project, go to
http://www.opensmartcare.org.

Davide Mauri
Davide wrote Chapters 15 and 16 in the SQL Server Integration Services (SSIS) section and
kindly reviewed Lynn’s writing for the remainder of the SSIS chapters.

Davide holds the following Microsoft certifications: MCP, MCAD, MCDBA, Microsoft Certified
Teacher (MCT), and Microsoft Most Valued Professional (MVP) on SQL Server. He has worked
with SQL Server since version 6.5, and his interests cover the whole platform, from the rela-
tional engine to Analysis Services, and from architecture definition to performance tuning.
Davide also has a strong knowledge of XML, .NET, and the object-oriented design principles,
which provides him with the vision and experience to handle the development of complex
business intelligence solutions.

Having worked as a Microsoft Certified Teacher for many years, Davide is able to share his
knowledge with his co-workers, helping his team deliver high-quality solutions. He also
works as a mentor for Solid Quality Mentors and speaks at many Italian-language and inter-
national BI events.

Sahil Malik
Sahil Malik wrote Chapter 25 in the SQL Server Reporting Services (SSRS) section.

Sahil, the founder and principal of Winsmarts, has been a Microsoft MVP and INETA speaker
for many years. He is the author of many books and numerous articles. Sahil is a consultant
and trainer who delivers training and talks at conferences internationally.

Kevin Goff
Kevin wrote Chapters 10 and 11 on MDX in the SQL Server Analysis Services (SSAS) section.

John Welch
John Welch was responsible for the technical review of this book.

John is Chief Architect with Mariner, a consulting firm specializing in enterprise reporting and
analytics, data warehousing, and performance management solutions. He has been working
with business intelligence and data warehousing technologies for six years, with a focus on
Microsoft products in heterogeneous environments. He is a Microsoft MVP, an award given
to him for his commitment to sharing his knowledge with the IT community. John is an expe-
rienced speaker, having given presentations at Professional Association for SQL Server (PASS)
conferences, the Microsoft Business Intelligence conference, Software Development West (SD
West), the Software Management Conference (ASM/SM), and others.

John writes a blog on business intelligence topics at http://agilebi.com/cs/blogs/bipartisan. He


writes another blog focused on SSIS topics at http://agilebi.com/cs/blogs/jwelch/. He is also
active in open source projects that help make the development process easier for Microsoft
BI developers, including BIDS Helper (http://www.codeplex.com/bidshelper), an add-in for
Business Intelligence Development Studio that adds commonly needed functionality to the
environment. He is also the lead developer on ssisUnit (http://www.codeplex.com/ssisUnit), a
unit testing framework for SSIS.

You might also like