пятница, 19 сентября 2014 г.

Granularity and Aggregation


By granularity, we mean the level of detail at which the data is stored in the data warehouse. The golden rule here is to store the data at the lowest level of detail possible. For a retail company, this means the individual sales transaction level; for a mobile operator, it is the call detail record level. In the early days of data warehousing, disk space was expensive and computer power limited, but with today’s state of technology storing and querying, terabytes of data are within reach of most organizations.
One of the misconceptions about star schemas is that the fact tables must always be pre-aggregated. Aggregation can be a good thing for increasing query performance, but only after loading the data at the lowest level of detail. It’s very easy to see whether this lowest level is actually loaded: If you need to do a sum() function when loading the fact records, you’re not loading the lowest level of detail. The reasoning behind designing the data warehouse to capture the lowest level of detail is very simple. You can always aggregate data when the details are available, but when the details aren’t available it’s impossible to add them without having to rebuild the entire data warehouse.
Aggregation can have a dramatic impact on performance. A table that aggregates data by month, region, and product category to display in a management report contains probably more than 1,000 times less data than the lowest level transaction fact table. We’ve witnessed query speed increases from an average of 30 minutes to a couple of milliseconds by using aggregate techniques. Although these results are spectacular, you shouldn’t forget that the data model exposed to the outside world (the end users) is the detailed granular one. The existence of aggregate tables should be invisible to end users, while the redirection of queries from the detail tables to the aggregates should be handled by an intelligent query governor.
This query-governing mechanism is not yet available in open source databases, but fortunately you can use the aggregate designer of Mondrian to at least partly benefit from the automated creation and usage of aggregate tables in your data warehouse solution. Using the aggregate designer along with a columnar database such as LucidDB or Infobright, you can achieve very good query results combined with the availability of detailed fact data.