So far, we’ve been mostly talking about denormalizing the data in the data
mart dimensions. As a result, the query join paths to the fact table are only one
level deep. The only exception thus far has been the use of bridge tables for
multi-valued dimension columns. You’ve seen that the general rule of thumb
when modeling data marts is to denormalize the dimension tables. When
you take this to the extreme, you can denormalize even further. The ultimate
denormalized data model consists of a single table, at least from the user’s
point of view. Dutch author and consultant Dr. Harm van der Lek described
this as the One Attribute Set Interface (OASI) concept, in which all non-key
attributes in a star schema are published to the end user and/or query tool as a
single list. At the other end of the scale, you can find the completely normalized
data models that are mostly used in transaction systems. Dimensional data
marts are positioned somewhere in the middle between these two extremes.
Using normalization in a star schema is usually called snowflaking, to indicate the resemblance of this kind of schema with an actual snowflake. As with any data warehouse modeling technique, there are advocates and opponents of using snowflakes in dimensional data marts. Ralph Kimball strongly opposes using snowflakes with only one exception, explained in the following section. We’d like to list another exception, which is called clustering. This concept is described in a paper by Dr. Daniel Moody, which can be downloaded at .
The original paper is from 2000, and in the summer and fall of 2003 Moody wrote two subsequent articles for The Data Warehouse Institute (TDWI), which are still worthwhile reading. These latter articles drop the term clustering and introduce the term starflake, which boils down to the same thing. The issue at hand is caused by multiple references to the same normalized table in a data mart. In our WCM example, we have this situation with customers, warehouses, employees, and suppliers all of whom reference the same region and country table in their address fields. In a strictly enforced star schema, we need to build four denormalization transformations, one for each dimension. In this case, Moody advises to cluster the region/country table and make this a shared subdimension for all four dimension tables. The rule of thumb is that as soon as a so called fork appears in the data model, the lookup table is not denormalized but used as a cluster table.
A fork means that two candidate dimension tables reference the same lookup table, as you can see in Left Table. The diagram shows an example of a strictly normalized solution on the left and a clustered star or starflake schema on the right.
Using normalization in a star schema is usually called snowflaking, to indicate the resemblance of this kind of schema with an actual snowflake. As with any data warehouse modeling technique, there are advocates and opponents of using snowflakes in dimensional data marts. Ralph Kimball strongly opposes using snowflakes with only one exception, explained in the following section. We’d like to list another exception, which is called clustering. This concept is described in a paper by Dr. Daniel Moody, which can be downloaded at .
The original paper is from 2000, and in the summer and fall of 2003 Moody wrote two subsequent articles for The Data Warehouse Institute (TDWI), which are still worthwhile reading. These latter articles drop the term clustering and introduce the term starflake, which boils down to the same thing. The issue at hand is caused by multiple references to the same normalized table in a data mart. In our WCM example, we have this situation with customers, warehouses, employees, and suppliers all of whom reference the same region and country table in their address fields. In a strictly enforced star schema, we need to build four denormalization transformations, one for each dimension. In this case, Moody advises to cluster the region/country table and make this a shared subdimension for all four dimension tables. The rule of thumb is that as soon as a so called fork appears in the data model, the lookup table is not denormalized but used as a cluster table.
A fork means that two candidate dimension tables reference the same lookup table, as you can see in Left Table. The diagram shows an example of a strictly normalized solution on the left and a clustered star or starflake schema on the right.
This approach has several advantages. First of all, it’s a minor issue, but the
dimension table gets a little smaller. Of more importance is the maintenance of
starflake tables: Changes occur only in one table and the ETL process only has
to refresh one table instead of two or more. There are disadvantages as well,
of course. You need to create extra views (when you model the solution in
advance) or use extra aliases in your queries because you cannot reference the
same lookup table in a query where customers and employees are involved.
The biggest drawback, however, is that you are creating dependencies in your
ETL process. You need to make sure that the region/country lookup table
is processed before the dimensions that use this table or you run the risk of
inconsistencies in your data warehouse.
If you want to adhere to a strict star schema model, which means a
maximum level of 1 for the joins between facts and dimensions, there’s also
another solution. Instead of snowflaking the clustered tables, you can treat
them as regular dimensions. This means that the dimension keys will be part of
the fact table as well. Drawbacks to this way of modeling are that you always
need to traverse the fact table to get to the region/country of a customer,
and of course that you need extra keys in your fact table, which can make it
unnecessarily wide.