пятница, 19 сентября 2014 г.

Snowflakes and Clustering Dimensions


So far, we’ve been mostly talking about denormalizing the data in the data mart dimensions. As a result, the query join paths to the fact table are only one level deep. The only exception thus far has been the use of bridge tables for multi-valued dimension columns. You’ve seen that the general rule of thumb when modeling data marts is to denormalize the dimension tables. When you take this to the extreme, you can denormalize even further. The ultimate denormalized data model consists of a single table, at least from the user’s point of view. Dutch author and consultant Dr. Harm van der Lek described this as the One Attribute Set Interface (OASI) concept, in which all non-key attributes in a star schema are published to the end user and/or query tool as a single list. At the other end of the scale, you can find the completely normalized data models that are mostly used in transaction systems. Dimensional data marts are positioned somewhere in the middle between these two extremes.
Using normalization in a star schema is usually called snowflaking, to indicate the resemblance of this kind of schema with an actual snowflake. As with any data warehouse modeling technique, there are advocates and opponents of using snowflakes in dimensional data marts. Ralph Kimball strongly opposes using snowflakes with only one exception, explained in the following section. We’d like to list another exception, which is called clustering. This concept is described in a paper by Dr. Daniel Moody, which can be downloaded at .
The original paper is from 2000, and in the summer and fall of 2003 Moody wrote two subsequent articles for The Data Warehouse Institute (TDWI), which are still worthwhile reading. These latter articles drop the term clustering and introduce the term starflake, which boils down to the same thing. The issue at hand is caused by multiple references to the same normalized table in a data mart. In our WCM example, we have this situation with customers, warehouses, employees, and suppliers all of whom reference the same region and country table in their address fields. In a strictly enforced star schema, we need to build four denormalization transformations, one for each dimension. In this case, Moody advises to cluster the region/country table and make this a shared subdimension for all four dimension tables. The rule of thumb is that as soon as a so called fork appears in the data model, the lookup table is not denormalized but used as a cluster table. 

A fork means that two candidate dimension tables reference the same lookup table, as you can see in Left Table. The diagram shows an example of a strictly normalized solution on the left and a clustered star or starflake schema on the right.
 This approach has several advantages. First of all, it’s a minor issue, but the dimension table gets a little smaller. Of more importance is the maintenance of starflake tables: Changes occur only in one table and the ETL process only has to refresh one table instead of two or more. There are disadvantages as well, of course. You need to create extra views (when you model the solution in advance) or use extra aliases in your queries because you cannot reference the same lookup table in a query where customers and employees are involved. The biggest drawback, however, is that you are creating dependencies in your ETL process. You need to make sure that the region/country lookup table is processed before the dimensions that use this table or you run the risk of inconsistencies in your data warehouse.
If you want to adhere to a strict star schema model, which means a maximum level of 1 for the joins between facts and dimensions, there’s also another solution. Instead of snowflaking the clustered tables, you can treat them as regular dimensions. This means that the dimension keys will be part of the fact table as well. Drawbacks to this way of modeling are that you always need to traverse the fact table to get to the region/country of a customer, and of course that you need extra keys in your fact table, which can make it unnecessarily wide.