Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Declaring the Grain

kimballgroup.com/2003/03/declaring-the-grain/

By Ralph Kimball March 1, 2003

In debugging literally thousands of dimensional designs from my students over the years, I
have found that the most frequent design error by far is not declaring the grain of the fact
table at the beginning of the design process. If the grain isn’t clearly defined, the whole
design rests on quicksand. Discussions about candidate dimensions go around in circles,
and rogue facts that introduce application errors sneak into the design.

Declaring the grain means saying exactly what a fact table record represents. Remember
that a fact table record captures a measurement. Example declarations of the grain include:

An individual line item on a customer’s retail sales ticket as measured by a scanner


device
An individual transaction against an insurance policy
A line item on a bill received from a doctor
A boarding pass used by someone on an airplane flight
An inventory measurement taken every week for every product in every store.

In Business Terms

Notice that most of these grain declarations are expressed in business terms. Perhaps you
were expecting the grain to be a traditional declaration of the fact table’s primary key.
Although the grain ultimately is equivalent to the primary key, it’s a mistake to list a set of
dimensions and then assume that this list is the grain declaration. This is the most common
mistake in my students’ designs. In a properly executed dimensional design, the grain is
first anchored to a clear business object (no pun intended) and a set of business rules.
Then, the dimensions that implement that grain become obvious.

So, when you make a grain declaration, you can then have a very precise discussion of
which dimensions are possible and which are not. For example, a line item of a doctor’s bill
likely would have the following dimensions:

Date (of treatment)


Doctor (may be called “provider”)
Patient
Procedure
Primary Diagnosis
Location (presumably the doctor’s office)
Billing Organization (an organization the doctor belongs to)
Responsible Party (either the patient or the patient’s legal guardian)
Primary Payer (often an insurance plan)
1/4
Secondary Payer (maybe the responsible party’s spouse’s insurance plan).

And quite possibly others.

Powerful Effects
If you’ve been following this example, I hope you’ve noticed some powerful effects from
declaring the grain. First, you can visualize the dimensionality of the doctor bill line item
very precisely, and you can therefore confidently examine your data sources, deciding
whether or not a dimension can be attached to this data. For example, you probably would
exclude “treatment outcome” from this example because most medical billing data doesn’t
tie to any notion of outcome.

But a general entity/relation-oriented “data model” of doctor visits might well include
treatment outcome. After all, in an abstract sense, doesn’t every treatment have an
outcome?

The discipline of insisting on the grain declaration at the beginning of a dimensional design
keeps you from making this kind of mistake. A model of billable doctor visits that included
treatment outcome would look like a dimensional model but it wouldn’t be implementable.
This is my main gripe with many of the current offerings of “standard schemas” in books
and CDs. Because they have no grain discipline, they often combine entities that don’t
coexist in real data sources. Every fact table design must be rooted in the realities of
available physical data sources.

A second major insight from the doctor bill line item grain declaration is that this very atomic
grain gives rise to many dimensions! I listed 10 dimensions, and experts in health care
billing probably know of a few more. It’s an interesting realization that the smaller and more
atomic the measurement (fact table record), the more things you know for sure, and the
more dimensions you have. This phenomenon is another way of explaining why atomic
data resists the ad hoc attack by end users. Atomic data has the most dimensionality and
so it can be constrained and rolled up in every way possible for that data source. Atomic
data is a perfect match for the dimensional approach.

In the doctor bill line item example with 10 dimensions, you wouldn’t expect the primary key
of the fact table to consist of all 10 dimensional foreign keys. Logically from a business
rules perspective, perhaps we know that the combination of date, doctor, patient, and
procedure are enough to guarantee a unique record. So these fields could implement the
fact table primary key. Or, it is possible that we have extra “degenerate” keys for the patient
bill number and the line item number that would, by themselves, implement an acceptable
physical fact table key. But we are confident that we can add these degenerate dimensions
to the design because they are consistent with our grain declaration. The grain declaration
is like a contract!

The grain declaration lets us think creatively about adding dimensions to a fact table design
that may not obviously be present in the source data. In retail sales data, marketplace
causal factors like promotions and competitive effects may be very important to
understanding the data, but this information may not be present in a literal sense in the data
extract. The grain definition (see the first example in our list of grains) tells us that we can
2/4
indeed add a causal “store condition” dimension to the fact table as long the store condition
descriptions vary appropriately by time, product, and store location. A weather dimension
can be added to many fact table designs using the same logic. Once such a new
dimension is identified, it is incumbent on the data warehouse designer to find the
appropriate store condition or weather data source and insert it into the backroom data
staging applications that build the fact tables.

All of the grain declarations listed in this article represent the lowest possible granularity of
their respective data sources. These data measurements are “atomic” and cannot be
divided further. But it is quite possible to declare higher level grains for each of these data
sources that represent aggregations of atomic data:

All the sales for a product in a store on a day


Insurance policy transaction totals by month by line of business
Charged amount totals by treatment by diagnosis by month
Counts of passengers and other flight customer-satisfaction issues by route by month
Average inventory levels by quarter by region.

These higher levels of aggregation will usually have fewer, smaller dimensions. The doctor
example might end up with only the dimensions:

Month
Doctor
Procedure
Diagnosis.

It would be nonsensical in an aggregated fact table to try to include all the original
dimensions of the atomic data, because you would find yourself recapitulating the atomic
level of the data!

Useful aggregations necessarily shrink dimensions and remove dimensions; therefore,


aggregated data always needs to be used in conjunction with its base atomic data because
aggregated data has less dimensional detail. Some developers get confused on this point,
and after declaring that data marts necessarily consist of aggregated data, they criticize the
data marts for “anticipating the business question.” All these misunderstandings disappear
when aggregated data is made available along with the atomic data from which it’s derived.

Keep Facts True to the Grain

The most important result of declaring the grain of the fact table is anchoring the discussion
of the dimensions. But declaring the grain lets you be equally clear about the measured
numeric facts. Simply put, the facts must be true to the grain. In the doctor example, the
most obvious measured fact would be “billed amount,” relating to the specific line item.

Other facts relating to the treatment that patient received at that time may also exist. But
helpful facts, such as the amount billed year-to-date to this patient for all treatments, aren’t
true to the grain. When a reporting application combines fact records arbitrarily, these

3/4
untrue-to-the-grain facts produce nonsensical, useless results. Viewed in this way, these
facts are dangerous because they invite the end user to make mistakes. Omit them from
the design. Calculate such aggregate measures in your application.

Brief Recap

In summary, try to do your dimensional designs using the following four steps, in order:

1. Decide on your sources of data.

2. Declare the grain of the fact table (preferably at the most atomic level).

3. Add dimensions for “everything you know” about this grain.

4. Add numeric measured facts true to the grain.

4/4

You might also like