Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Star schema

Last week somebody asked me that classic question again: which one should I do, star schema or snowflake and why? Star schema, because: 1. Query performance Because it is 1 level join, star schema perform better in query times than snowflake. We dont need to join multiple tables to form a dimension. All we need to join is the dimension table and the fact table. If your BI tool is all in memory such as QlikView, PowerPivot, and SSAS (MOLAP), then they are fast. But if your BI tool is touch the disk everytime such as BO InfoView and SSRS, then you need 2. Simplicity in design This is the main reason why I prefer star schema than snowflake. It is simple. We do not normalise the dimension. We put all the product attributes in 1 dimension. It makes it simple. It is easy to relate one attribute and another. We dont need to join tables. The most important reason to choose Star Schema is this: simplicity in SCD type 2. With Star Schema, product dimension is one table so it is easier to make it type 2. With Snowflake it is scattered over 5-6 tables and it is a nightmare to implement type 2. 3. Flexibility for hierarchy The relationship between hierarchy levels can be different for each dimension. The relationship of city-area-region in the customer dimension can be different from the relationship of city-area-region in the branch dimension. And it can be different in supplier dimension. We dont put them in 1 central location like in Snowflake. This makes it flexible to define the hierarchy. 4. Compatibility with BI Tools and database engine Star schema has wider recognition in the DWBI industry. Some database engines such as SQL Server 2008 recognise that it is a star schema and optimise the query accordingly. Some BI tools such as BO and MicroStrategy can create the metadata layer automatically based on star schema. As a principle, we always have to look at the other side. We must not look at one side only. The disadvantages of Star Schema are: 1. Longer load time Because we need to perform joins to get all the data required, the ETL in Star Schema takes

longer time than Snowflake. But please see SCD type 2 in point 2 above. It make it longer to load a Snowflake Schema if the dimension is type 2. 2. Inflexibility In Star Schema the dimension is already formed in its entirity. So its not possible to dismantle it and offer half of it to one fact table, and offer the other half to another fact table. In Snowflake we can. The dimension is stored in several pieces, enabling us to take the pieces we need. Now, after all above, you need to read this: When to snowflake (link).

Snowflake schema
One of the most frequently asked questions in dimensional modelling is when to snowflake. Everybody talks about when not to snowflake, but when to snowflake is very rarely discussed. It is a general consensus in the data warehousing world that we must always use star schema for the presentation layer, but as with everything else in the world, there are always two sides of it. Sometimes more than two. In this article I will not go through (at all) about star vs snowflake arguments. There are many web pages and books explaining this. Im going to jump straight to when do we snowflake. I found that there are a few situations where we should consider snowflake: 1. When the sub dimension is used by several dimensions 2. When the sub dimension is used by both the main dimension and the fact table(s) 3. To make base dimension and detail dimensions 4. To enrich a date attribute Ill go through one by one. As usual it is easier to explain & learn by example. When the sub dimension is used by several dimensions Example: In insurance data warehouse, the City-Country-Region columns which exist in DimBroker, DimPolicy, DimOffice and DimInsured, could be replaced by LocationKey pointing to DimLocation. Some call it DimGeography. This gives us a consistent hierarchy, i.e. relationship between City, Country & Region. Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America

and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia). It is a common understanding that the relationship between City and Country are more or less fixed, but the grouping of countries might be different between dimensions. If we put City, Country & Region in each of the 4 dimensions, we have the flexibility of each dimension having different hierarchy. In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub. Some people go for c) create DimOfficeLocation, but I dont see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a). Other examples of this case are: a) DimManufacturer and DimPackaging, used by several product dimensions in a manufacturing data mart, b) DimEmployee in Project data mart, and c) DimCustomer in a CRM data warehouse, d) DimBranch (some call it DimOffice) in Retail Banking warehouse. Ive heard a few discussion about DimAddress, which like DimLocation is used by DimOffice, DimCustomer, DimSupplier etc (Retail mart), but usually the conclusion was its better to put address attributes (street, post code, etc) directly in the main dimension. Sometimes the designer use hybrid approach, i.e. they put the attributes both in the main dim and in the sub dim. Again this is for flexibility reason, particularly if the sub dim is used directly by several fact tables (see below). When the sub dimension is used by both the main dimension and the fact table(s) A classic example of this situation is DimCustomer, when an account can only belong to 1 customer. DimCustomer is used in DimAccount, and is also used by the fact tables. Other examples are: DimManufacturer, DimBranch, DimProductGroup. In the case of DimProductGroup, some fact tables are at product level, but some fact tables are at Product Group level. Hence we need both DimProduct and DimProductGroup. The option here is a) put the product group attribute in both DimProduct and DimProductGroup, or b) snowflake, i.e. DimProduct doesnt have Product Group attributes; it only has ProductGroupKey. To Make A Base Dimension and Detail Dimensions The classic examples of this case are: Insurance classes or LOB (line of business), Retail Banking account types, Attributes for different Product Lines, and in Pharma we have Disease/Medicine categories. Insurance policies from insurance classes or LOB (e.g. marine, aviation, motor, property) have different attributes. So we pull the common attributes into 1 dimension called DimBasePolicy and the class-specific attributes into DimMarinePolicy, DimMotorPolicy, etc.

LOB is a US market term, whereas Class is a Lloyds term (London market). Similarly, in retail banking we have DimBaseAccount, DimSavingAccount, DimMortgageAccount, DimCurrentAccount, etc. In investment banking different asset classes have different attributes. The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so. Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the base-detail approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join. To Enrich a Date Attribute A date attribute (e.g. 2011-03-11) is often analysed by Month, Quarter or Year so if we have MaturityDate, EffectiveDate and CancellationDate in the dimension and the business needs to analyse each of them by Week, Month, Quarter, Year then we would need to have 12 attributes in the dim for this. By replacing the date attribute with the date key, we could analyse the date by any attributes in the date dim. In this case Id recommend to use a smart integer date key rather than a surrogate date key, e.g. 20110311 rather than 12345. It is more userfriendly, it would still have meaning if you dont link to DimDate. The smart date key is a generally accepted exception to the Kimball rule of surrogate key.

You might also like