Pentaho OLAP Design Guidelines

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Best Practice -

Pentaho OLAP
Design Guidelines
This page intentionally left blank.
Contents
Overview .............................................................................................................................................................. 1
Schema Maintenance Tools .......................................................................................................................... 2
Use Schema Workbench to create and maintain schemas .................................................................. 2
Limit number of schemas per database connection ............................................................................ 2
Schema Object Naming ................................................................................................................................. 2
Use name and caption where available .................................................................................................. 2
Specific Dimension Handling ........................................................................................................................ 2
Identify date and time dimensions and levels with Pentaho Analyzer ............................................... 2
Use all dates between start and end date in dimension table ............................................................ 3
Populate dimension tables with values that exist in the fact table .................................................... 3
Add geographic annotations to your schema ........................................................................................ 3
Measure Definition ........................................................................................................................................ 3
Create measures and dimension keys .................................................................................................... 3
Organize measures into sub-groups ....................................................................................................... 3
Dimension and Hierarchy Definitions ......................................................................................................... 4
Create multiple hierarchies in dimensions............................................................................................. 4
Populate approxRowCount of every hierarchy...................................................................................... 4
Use shared dimensions............................................................................................................................. 4
Development Process.................................................................................................................................... 4
Extend Analyzer schemas incrementally ................................................................................................ 4
Test your schema before publishing ....................................................................................................... 5
Related Information ........................................................................................................................................... 5
This page intentionally left blank.
Overview
This document is intended to provide best practices around how to design and build your Pentaho
OLAP solution for maximum speed, reuse, portability, maintainability, and knowledge transfer.

Topics are arranged in a series of groups with individual best practices for that topic explained. It is
not intended to demonstrate how to implement each best practice or provide templates based on
the best practices defined within the document.

Keep these Pentaho Architecture principles in mind while you are working through this document:

1. Architecture is important, above all else.


2. Platforms are always evolving: sometimes you will have to think creatively.

Some of the things discussed here include schema maintenance, schema object naming, specific
dimension handling, and measure, dimension, and hierarchy definitions.

The intention of this document is to speak about topics generally; however, these are the specific
versions covered here:

Software Version
Pentaho 5.4, 6.0, 6.1

Pentaho OLAP Best Practices


Pentaho 1
Schema Maintenance Tools
Use Schema Workbench to create and maintain schemas
Use Schema Workbench for the primary creation and maintenance tool of your schemas.

• Definition - Use the Pentaho Data Access wizard only to create the Analyzer schema. Use
Schema Workbench or XML editor from that point onward.
• Rationale - The DA Wizard only provides about 70% of the functionality that is provide to
Schema Workbench or in the XML.

Limit number of schemas per database connection


Limit the number of schemas per database connection to one. You can have multiple cubes in one
schema as long as the cubes belong to the same connection.

• Definition - Define multiple cubes for each fact table and use shared dimensions that are
conformed across the multiple facts. Do not use the Data Access Wizard as it only allows you
to create one cube per schema.
• Rationale - By using multiple cubes per schema, you get better caching performance easier
maintenance and security integration.

Schema Object Naming


Use name and caption where available
Use name and caption where available, since most schema objects have a caption property.

• Definition – Use the caption property to display to the user of the data. The designer and
code refers to the name property. Use same case without spaces in the name. When doing
so, you can write your MDX query without using [].
• Rationale – If you change the name later on, the code will not work properly. Changing the
caption is easier and can be localized/internationalized.

Specific Dimension Handling


Identify date and time dimensions and levels with Pentaho Analyzer
Make sure that your date/time dimension and levels are identified to Pentaho Analyzer.

• Definition – Your date dimension should be of type TimeDimension. Each level in all time
hierarchies should have the levelType attribute set to one of the TimeXYZ types. Add
AnalyzerDateFormat Annotations to each level down to the day level as described in
Enable Relative Date Filters in Mondrian. Sub-day annotations are not supported at this
time.
• Rationale – Pentaho has extra features available to time dimensions and each level type.
For instance, it knows how to calculate Year to date, Month to date, etc., because of this
identification. Analyzer goes further by providing a handy GUI interface for relative date
filtering ONLY to those hierarchy levels that have AnalyzerDateFormat properly defined.

Pentaho OLAP Best Practices


Pentaho 2
Use all dates between start and end date in dimension table
Populate your dimension table with every date between your data’s start and end date.

• Definition – During data load you can populate the data with only all dates between your
data’s start and end date.
• Rationale –This allows you to analyze dates without data. This analysis is often very useful
and difficult without date members for each day.

Populate dimension tables with values that exist in the fact table
For all dimensions other than time dimensions, do not populate your dimension table with values that
do not exist in the fact table.

• Definition – During data load you can populate your dimensions with only those that you
have data for.
• Rationale – Dimensions have many more possible values than data is provided. It increases
overhead and memory usage for values that cannot be analyzed. Contrasting to dates, there
will always be dates in time, but might not be any data for that date.

Add geographic annotations to your schema


Identify your geographic levels to Analyzer by adding annotations to the schema.

• Definition – Add Data.Role and Geo.Role annotations to make sure that each level is
recognized by the Geo-mapping visualization in Analzyer as described in Add Geo Map
Support to a Mondrian Schema. Alternatively, if you have run the Data Access Wizard, it
should identify and add those annotations for you.
• Rationale – This allows Analyzer to know how to drill up and down on geographic
dimension levels.

Measure Definition
Create measures and dimension keys
Create all possible varieties of a measure and dimension keys in a cube.

• Definition – Define the following measures, dimension, and primary keys:


o Sum, min, max, and avg measures for all facts.
o Distinct count for all dimension key fields.
o Count and distinct count measures for the primary key of the fact table.
o Optional: Create a spread (Max-Min) measure to provide a range of values to be used
for distribution analysis.
• Rationale – These additional measures might not be requested or used in Analyzer, but
provides business users with maximum flexibility for data analysis without having to modify
the schema later.

Organize measures into sub-groups


Use the AnalyzerBusinessGroup annotation to organize measures into sub-groups.

• Definition – Add Business Groups in Pentaho Documentation describes this process.


• Rationale – This helps users navigate and find measures in the Analyzer UI when you have
many measures and natural grouping are needed by users.

Pentaho OLAP Best Practices


Pentaho 3
Dimension and Hierarchy Definitions
Create multiple hierarchies in dimensions
Create multiple hierarchies in dimensions where different start points and drill-downs are necessary

• Definition – Analyzer schemas support multiple hierarchies per dimension. Avoid too many
single level hierarchies in a dimension. Many single level hierarchies should be converted
into Member properties. Each child level of a hierarchy should have more rows than the
parent and be directly related to the parent level. Independent levels should be on their own
hierarchy in the same dimension.
• Rationale – Multiple hierarchies should be used when you will need to analyze a lower level
of a hierarchy independent of the higher levels. This is most common with time/date. A day
of month level allows analysis for which day is most active, etc. without regard to month or
year levels. In that case, a day of month level is the top-most level in a new hierarchy.

Populate approxRowCount of every hierarchy


Populate the approxRowCount attribute for all levels of every hierarchy.

• Definition – In Schema Workbench or XML, Pentaho allows you to specify the estimated
number of rows. This does not need to be exactly the correct number or the right
magnitude (correct number of 0’s).
• Rationale – Analyzer uses this attribute to determine how to load and cache members.
When this is not specified, Analyzer has to build queries to go after this data at runtime,
impacting performance.

Use shared dimensions


Make frequent but appropriate usage of shared dimensions.

• Definition – Don't use a shared dimension that is not used across more than one cube in a
schema. If it is only one, then it is not shared. Make it a local dimension. If all cubes have
only conformed dimensions, then you should consider just having one cube. Normally each
cube has at least one dimension that is not conformed. Inferred in this practice is the usage
of multiple cubes within one schema. A schema should house all cubes related to that
database connection.
• Rationale – Shared dimensions allow Analyzer to cache members and reuse them across
multiple cubes in the same schema. This improves performance and reduced memory
required.

Development Process
Extend Analyzer schemas incrementally
Create and extend an Analyzer schema incrementally.

• Definition – Create only the most basic schema first and then expand (one cube, one
dimension, one hierarchy, one level, one measure). When expanding, only add one or a few
measures or hierarchies before testing again.
• Rationale – This avoids hard-to-trace errors on the server and allows only functioning
schemas to be provided to users. When too many pieces are added to a schema, the true
error can be hard to track down.

Pentaho OLAP Best Practices


Pentaho 4
Test your schema before publishing
Run a test MDX query on each cube in your schema before publishing to the BA Server.

• Definition – In Schema Workbench, go to File/New MDX Query. If you can connect to the
schema, that means that Analyzer can at least parse the schema. Many errors are caught
this way. Further test can be a simple MDX query “SELECT Measures.members on COLUMNS
from [MyCubeName]”. This simple test will prove you can connect to the cube, database and
get the default members for all hierarchies.
• Rationale – This avoids hard to trace errors on the server and allows only functioning
schemas to be provided to end users.

Related Information
These articles provide more detail than we go into here:

• Add Business Groups


• Add Geo Map Support to a Mondrian Schema
• Enable Relative Date Filters in Mondrian

If you are using this best practices document, we would be happy if you would leave
us a comment or suggestion to let us know what you think! This will help us learn
about who is using our best practices, and also give us some insight as to what you find
helpful about them.

As always, if you have a more information or a solution that would enhance this document,
we would love to hear about that, too.

Pentaho OLAP Best Practices


Pentaho 5

You might also like