Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Case Study of Building a Data Warehouse with Analysis Services (Part One)

Date: Feb 10, 2006 By Baya Pavliashvili. In the first of a two-part series, Baya Pavliashvili, database administration expert, offers solutions to your business problem using a data warehouse. Data warehousing has been around for decades. Yet, many businesspeople and quite a few technical folks dont know what it takes to build a warehouse. Most people think that a warehouse is a data store that contains all data within the enterprise, is built within a couple of weeks, and thousands of people can use it for years to come without any additional effort or expense. Unfortunately, this view is incorrect. In this series of articles, I give you a very simplistic example scenario and show you how you can go about resolving a business problem using a data warehouse. I also describe the efforts involved in building a warehouse for technical as well as non-technical individuals. The first of this two-part article gives you an overview of steps involved in building a data warehouse and introduces the example scenario. It also teaches you how to create and populate a dimensional model. The second article goes into detail about Analysis Services, MDX, and analytical views that are generated from the data warehouse. Data Warehouse Lifecycle A Data Warehouse (DW) lifecycle can be summarized as follows: 1. 2. 3. 4. 5. 6. 7. 8. Determine the reports that DW is supposed to support. Identify data sources. Extract data from their transactional sources. Populate the staging area with the data extracted from transactional sources. Build and populate a dimensional database. Build Extraction Transformation and Loading (ETL) routines to populate the dimensional database regularly. Build and populate Analysis Services cubes. Build reports and analytical views by: o Using a third-party application. o Creating a custom analytical application and writing Multi-Dimensional eXpressions (MDX) queries against cubes. Maintain the warehouse by adding/changing supported features and reports.

9.

Reading through these points, you should be assured that:

A DW is never "finished." It is an entity that keeps growing along with the organizations reporting needs. There is no such thing as a free lunch. If you want additional reports to come out of the warehouse, you need to spend additional time (and money) to extend the warehouse. Notwithstanding, what I said thus far, you can have DW functioning and available for use with a limited set of features as you build it.

The Sample Scenario Suppose I work for Northwind traders company. This hypothetical company sells products around the world and records data into the sample database that is created when you install SQL Server 2000. Lets further suppose that business owners would like to have analytical views, including graphs and charts that display

the companys performance broken down by customer, employee, supplier, and product. Having such a tool would help stakeholders to promote the products that fall short of selling in hot trading areas. As I mentioned before, the example in this article is very simplistic and therefore considerably easier to build than a usual warehouse. Let me give you a few reasons why this is the case. First, we have a set of reports in mind to be supported by the warehouse. Thats better than what a DW architect typically gets at the beginning of the project. Usually business users know they need a DW but cant really give you a concrete idea of what the warehouse will do for them. It might take at least a few interviews to figure out exactly what type of reports and analytical views users would like to see. Because data is already in a SQL Server database that has a fairly simple structure, the first few steps of a typical warehouse project are already done for us. In reality, you dont always get this lucky: The DW architect usually has to identify multiple data sources that will be used to populate the warehouse. The organizations data could be stored in various relational database management systems (Oracle, SQL Server, DB2, and MS Access being the most common), spreadsheets, email systems, and even in paper format. Once you identify all data sources, you need to create data extraction routines to transfer data from its source to a SQL Server database. Furthermore, depending on sources youre working with, you might not be able to manipulate the data until it is in SQL Server format. The Northwind database has intuitive object names; for example, the orders table tracks customer orders, employees table for records data about employees, and order details table tracking details of each order. Again, in the real world this might not be the caseyou might have to figure out what cryptic object names mean and exactly which data elements youre after. The DW architect often needs to create a list of data mappings and clean the data as it is loaded into the warehouse. For example, customer names might not always be stored in the same format in various data sources. The Oracle database might record a product name as "Sleeveless Tee for Men," whereas in Access you could have the same product referred to as "Mens T-Shirt (sleeveless)." Similarly, the field used to record product names could be called "product" in one source, "product_name," in another and "pdct" in the other. Once you have determined which data you need, you can create and populate a staging database and then the dimensional data model. Depending on the project, you may or might not have to have a staging database. If you have multiple data sources and you need to correlate data from these sources prior to populating a dimensional data structure, then having a staging database is convenient. Furthermore, staging database will be a handy tool for testing. You can compare a number of records in the original data source with the number of records in the staging tables to ensure that your ETL routines work correctly. Northwind database already has all data I need in easily accessible format; therefore, I wont create a staging database. Dimensional Modeling Dimensional modeling is somewhat different from its relational counterpart. I wont go into details of dimensional modeling here because such concepts have fine coverage in several books that each DW architect should read. Most commonly referenced dimensional modeling authors are Bill Inmon and Ralph Kimball. For the purposes of this article, Ill suffice to say that dimensional models consist of the fact and dimension tables. Typical fact tables contain numerous foreign keys referencing dimension tables. Dimension tables, on the other hand, usually contain very few columnsdimension key, value, create, and update date, and perhaps an obsolete date. Fact tables record occurrences of a measurable fact, such as customer orders. Dimension tables provide a way to slice business data across various diagonals of companys operations; for example, we can examine orders by customer or by product. You can use the "obsolete_date" column within dimension tables to track the history of values that change over time. This concept is known as slowly changing dimension. For example, consumers of your products might change their last names due to marriage, divorce, or for another personal reason. Similarly multiple departments within your organization can be combined into one, or one department can be divided. In some cases, you care to keep just the current value. If so, consider yourself luckyyou can simply override the existing value with the new value in the dimension table. In other cases, you must keep track of

the old value as well as the new value. This is when you use the record obsolete date to track the timeframe during which the record was valid. Northwind traders dimensional model will be very simple, consisting of four-dimension tables and a fact table. Notice that because this is just a static database and I wont have any new data to populate it regularly, I wont add the obsolete_date column to the dimensions. You can create fact and dimension tables using the following script:

CREATE TABLE dbo.dim_supplier( supplier_ident INT IDENTITY(1, 1), supplier_id INT NOT NULL, supplier_name VARCHAR(255) NOT NULL, supplier_city VARCHAR (255) NULL, country VARCHAR(255) NULL ) CREATE TABLE dbo.dim_product ( product_ident INT IDENTITY(1, 1), product_id INT NOT NULL, product_name VARCHAR(255) NOT NULL, discontinued BIT NOT NULL ) CREATE TABLE dbo.dim_customer ( customer_ident INT IDENTITY(1, 1), customer_id VARCHAR(20) NOT NULL, customer_name VARCHAR(255) NOT NULL, customer_city VARCHAR(255) NULL, customer_country VARCHAR(255) NULL ) CREATE TABLE dbo.dim_employee ( employee_ident INT IDENTITY(1, 1), employee_id INT NOT NULL, employee_name VARCHAR(85) NOT NULL, employee_city VARCHAR(255) NULL, employee_country VARCHAR(255) NULL ) CREATE TABLE dbo.dim_time ( time_member_key INT NOT NULL , calendar_date_dt DATETIME NOT NULL , calendar_day_of_week_num INT NOT NULL , calendar_day_of_week_name VARCHAR(15) NOT NULL , calendar_day_of_month_num INT NOT NULL , calendar_day_of_year_num INT NOT NULL , calendar_week_num INT NOT NULL , calendar_month_num INT NOT NULL , calendar_month_name VARCHAR (15) NOT NULL , calendar_quarter_num INT NOT NULL , calendar_year_num INT NOT NULL ) CREATE TABLE fact_sales (

customer_ident INT NOT NULL, product_ident INT NOT NULL, employee_ident INT NOT NULL, supplier_ident INT NOT NULL, total_sale SMALLMONEY NOT NULL, time_member_key INT NOT NULL )
Next lets populate these tables using the following queries:

-- supplier dimension: INSERT dim_supplier ( supplier_id , supplier_name , supplier_city , country ) SELECT supplierid, companyname, city, country FROM suppliers -- product dimension: INSERT dim_product ( product_id, product_name, discontinued) SELECT productid, productname, discontinued FROM products -- customer dimension: INSERT dim_customer ( customer_id, customer_name, customer_city, customer_country) SELECT customerid, companyname, city, country FROM customers -- employee dimension: INSERT dim_employee ( employee_id, employee_name, employee_city, employee_country) SELECT employeeid, TItleOfCourtesy + + FirstName + + LastName AS employee_name,

city, country FROM employees


Notice that dim_time is a special dimension. It isnt populated by data that is already in the warehouse. Instead we populate it with calendar dates and date parts (day, month, quarter, year, and so forth) so that we can aggregate warehouse data as needed. You can come up with a routine that populates your own time dimension; here is a sample store procedure that I use to populate the time dimension:

CREATE PROCEDURE load_dim_time ( @dim_table_name VARCHAR(255), @start_date_dt SMALLDATETIME, @end_date_dt SMALLDATETIME ) AS SET NOCOUNT ON DECLARE @sql_string NVARCHAR(1024) , @time_member_key INT , @calendar_date_dt SMALLDATETIME , @calendar_day_of_week_num INT , @calendar_day_of_week_name VARCHAR(10) , @calendar_day_of_month_num INT , @calendar_day_of_year_num INT , @calendar_week_num INT , @calendar_month_num INT , @calendar_month_name VARCHAR(10) , @calendar_quarter_num INT , @calendar_year_num INT SET @calendar_date_dt = @start_date_dt WHILE (@calendar_date_dt <= @end_date_dt) BEGIN IF NOT EXISTS ( SELECT time_member_key FROM dim_time WHERE calendar_date_dt = @calendar_date_dt ) BEGIN SELECT @calendar_day_of_week_num = DATEPART(dw, @calendar_date_dt) , @calendar_day_of_week_name = DATENAME(WEEKDAY, @calendar_date_dt) , @calendar_day_of_month_num = DATEPART(DD, @calendar_date_dt) , @calendar_day_of_year_num = DATEPART(DY, @calendar_date_dt) , @calendar_week_num = DATEPART(WK, @calendar_date_dt) , @calendar_month_num = DATEPART(M, @calendar_date_dt) , @calendar_month_name = DATENAME(MONTH, @calendar_date_dt) , @calendar_quarter_num = DATEPART(QQ, @calendar_date_dt) , @calendar_year_num = DATEPART(YYYY, @calendar_date_dt) , @time_member_key = CAST( CAST(@calendar_year_num AS VARCHAR) + RIGHT(00 + CAST(@calendar_day_of_year_num AS VARCHAR), 3) AS INT)

SELECT @sql_string = INSERT INTO + @dim_table_name + ( + time_member_key, + calendar_date_dt, + calendar_day_of_week_num, + calendar_day_of_week_name, + calendar_day_of_month_num, + calendar_day_of_year_num, + calendar_week_num, + calendar_month_num, + calendar_month_name, + calendar_quarter_num, + calendar_year_num + ) + VALUES + ( + CHAR(39) + CAST(@time_member_key AS VARCHAR) + CHAR(39) + , + CHAR(39) + CAST(@calendar_date_dt AS VARCHAR) + CHAR(39) + , + CAST(@calendar_day_of_week_num AS VARCHAR) + , + CHAR(39) + @calendar_day_of_week_name + CHAR(39) + , + CAST(@calendar_day_of_month_num AS VARCHAR) + , + CAST(@calendar_day_of_year_num AS VARCHAR) + , + CAST(@calendar_week_num AS VARCHAR) + , + CAST(@calendar_month_num AS VARCHAR) + , + CHAR(39) + @calendar_month_name + CHAR(39) + , + CAST(@calendar_quarter_num AS VARCHAR) + , + CAST(@calendar_year_num AS VARCHAR) + ) EXEC sp_executesql @sql_string END SET @calendar_date_dt = @calendar_date_dt + 1 END /* now use load_dim_time procedure to populate dim_time table with needed dates */ EXEC load_dim_time dim_time, 1/1/96, 1/1/99
SQL Server has a fine ETL toolData Transformation Services (DTS)which you can leverage to execute and schedule DW population routines. A typical DTS package determines which data rows need to be extracted from their source and inserts such rows into appropriate dimension and fact tables. Because this is a sample application, I wont need to create any DTS packages, but keep in mind that real-world ETL routines can get quite complicated and might take several weeks to develop. Summary In this first article of the series, I introduced you to the steps involved in building a typical data warehouse. You learned how to create and populate a dimensional data model which will be a cornerstone of your warehouse and analytical reports. The next article focuses on presenting warehouse data to the users. In Part Two of his series, database administration expert Baya Pavliashvili explores the challenges involved in building and maintaining a warehousing solution using a simple database warehouse.

The first article in this series introduced you to steps involved in building a data warehouse. It also presented an example scenario of solving business problems with a data warehouse. This article continues exploring challenges involved in building and maintaining a warehousing solution. Analysis Services Cubes So far we have dimension and fact tables in the warehouse. Now we can create dimensions and cubes within MS Analysis Services. Plenty of tutorials and articles exist, showing you step-by-step wizards that Analysis Services supplies for creating dimensions and cubes, so I wont repeat them here. Instead, Figure 1 shows you the final view of the data within Analysis Manager.

Figure 1 View of the cube data within Analysis Manager. NOTE This sample application requires no special treatment within Analysis Services. You can easily follow Dimension Wizard and Cube Wizard steps to build a reasonable cube. But in the real-world you need to define calculated members, treat members on each level within dimension hierarchy differently, and much more. I will show some tips and tricks within Analysis Services in my upcoming article. If we had a data source that was continuously updated by Northwind employees, we would also have to reprocess the cube to make sure it had up-to-date data. You can process cubes manually through Analysis Manager or use Analysis Services Processing Task within DTS. Analytical Views Once you have built cubes, you are ready to create custom reports for displaying the data to your users. You have several options for the DW presentation layer. Microsoft Excel pivot tables are a fine way to analyze data if you just need basic functionality. In fact, because most organizations use Excel as a spreadsheet management tool, you are likely to create many of your reports in Excel. Yet another option that comes with MS Office is using Office Web Components to create simple web pages for viewing the warehouse data. Several third-party vendors supply advanced analytics tools for a considerable price. Perhaps the most commonly known business intelligence vendors include Cognos, ProClarity, MicroStrategy, and Business Objects. Although you could write a custom reporting tool for your own companys needs, rest assured that third-party vendors have excellent offerings that you wont match with a small effort. Lets take a brief look at a couple of analytical views in ProClarity to see the functionality available in this tool. Figure 2 shows the top five products that Northwind sold in 1996.

Figure 2 ProClarity view of top five products sold in 1996. You can easily change the dimensions and measures presented on the report, simply by dragging and dropping the dimensions and measures you wish to see on rows and columns; you can also choose from a variety of chart types to meet the specific needs of your users.

ProClarity also allows you create a "decomposition tree" to see the breakdown of your business operations. For example, the decomposition tree in Figure 3 shows the top five customer countries where Northwind sold products in 1998. It further informs us we sold in 16 other countries in 1998, and these bottom 16 countries account for approximately 36 percent of sales.

Figure 3 Breakdown of 1998 sales by customer on country level. Next we can expand one of the top nodes, perhaps Germany, to find out which cities accounted for most sales there (see Figure 4).

Figure 4 Breakdown of sales in Germany. The decomposition shows that Cunewalde was the top city for sales accounting for almost half of all sales in Germany. On the other hand, the bottom six cities accounted for only 12 percent of all sales. If we drill down under Brazils Rio de Janeiro, we see that all sales in this city came from three stores (see Figure 5).

Figure 5 Sales in Rio de Janeiro. As you can see, ProClarity enables your business users to browse the data and create very intuitive and informative reports without any programming. Armed with this tool and the cubes you design, executives can make strategic decisions for improving the business. MDX Queries Discussion of data warehousing and business intelligence wouldnt be complete without a brief overview of MDX queries. MDX is a language for querying Analysis Services cubes. Unfortunately, MDX is poorly documented and has complicated grammar. However, if you wish to build your own front-end DW tools or even customize views produced by third-party tools, you must learn MDX. Writing very simple MDX queries is easy; in fact, such queries resemble SQL to a degree. The MDX shown in Figure 6 returns 1998 sales by country.

SELECT {Customer.[customer country] .members} on rows, {measures.[total sale] } on columns FROM sales WHERE ([time].[1998] )

Figure 6 Total sales by country in 1998. However, dont expect to see much similarity between SQL and MDX once you get past the basic queries. For example, lets see what it would take to report an average of all sales on [customer city] level if you also wanted to report the total sales within each U.S. city. Generating such a report in Transact-SQL is a 20-second task for any reasonable programmer; however, in MDX it is quite a challenge. A query would look like what is shown in Figure 7.

WITH SET blah AS {DESCENDANTS(customer. usa , 1)} SET rowset AS HIERARCHIZE({ blah , GENERATE( blah , {ANCESTOR(customer.CurrentMember, 1) })}) MEMBER measures. average AS avg( INTERSECT( blah , DESCENDANTS(customer.CurrentMember, customer. customer city )), measures. total sale ) SELECT {measures. average } ON COLUMNS, { rowset } ON ROWS FROM sales

Figure 7 Average sales in the US and total sales in each city. Im not an opponent of MDX; on the contrary, I believe MDX is an extremely powerful language, but due to lack of good documentation and a rather limited supply of MDX books it is difficult to master. Indeed, there are very few programmers who can write advanced MDX. The good news is that many front-end tools (including Excel and ProClarity) will write MDX for you. The not-so-good news is that MDX written by such tools isnt always what you need on your reports. In a nutshell, if your business users would like the functionality that isnt supported by tools such as Cognos or ProClarity, they should expect to spend top dollar on an MDX resource (and good luck finding a good one). DW Maintenance Much like any database system, data warehouse requires considerable maintenance that shouldnt be overlooked. As the warehouse grows, the database administrator must determine the optimal physical layout for data files, file groups, tables, and indexes. This is an example of operational maintenance. Data sources populating the warehouse tend to change; therefore, you will likely have to change the DTS packages as well. For example, if Northwind acquires another company (perhaps Southwind), your users will probably want to examine Southwinds orders using the existing tools. Similarly, if your boss decides that he would like a report showing Northwind orders by shippers to see which shipments get to their destination on time, youll have to add new dimensions and measures to your warehouse. Summary

This series of articles showed a very simple example of building a data warehousing solution. Even with this sample, numerous steps must be taken to build an accurate, functioning warehouse. Whats more important, the warehouse is never "finished"it continues to evolve and grow along with companys reporting requirements. Therefore, before undertaking a DW initiative, ensure that you have appropriate support from the business owners. Although your boss might request a data warehouse overnight, he/she isnt going to rip many benefits from such system. Be sure to mentor and educate your stakeholders so that their expectations are appropriate from the get-go

You might also like