пятница, 19 сентября 2014 г.

Monster Dimensions


Assume, you have the 150 million record customer table before. Maybe your customer dimension isn’t that big, but even with 5 million records, maintaining the dimension table in a limited batch window can be challenging enough. We’ve shown one of the solutions to make a monster dimension better manageable and optimize query performance: the mini-dimensions. Still, when your monster dimension contains a lot of detail columns that will not be used for analytical purposes anyway, it might be a good idea to split them into a separate table, also called vertical partitioning. The query speed in a row-based database like MySQL is not only determined by the number of rows, but also by the total byte size of each row. When a column-based database such as Infobright, LucidDB or MonetDB is used, this penalty is eliminated by the fact that the data is already stored in a column-by-column fashion, thereby eliminating the need to partition the table yourself. 


A third option is to partition the table horizontally. This is a quite common technique for fact tables that are usually partitioned by time period, but is not used very often for dimension tables. The problem with horizontal partitioning is that a partition key needs to be defined in such a way that the partitioning makes sense. The benefit of horizontal partitioning lies in the fact that when a query is parsed, the optimizer can determine which partitions to use and which not, based on the partition key. Most queries involve a date/time attribute, so when using a time-based partitioning scheme, this is easy. If your data is partitioned by month and the query affects only last month’s sales, the optimizer can select a single partition to retrieve the data from.
With customer data it’s not that simple. You cannot use a date field to partition by because no available date is relevant to your analytical queries. Other attributes that can serve as a candidate partitioning key might be geographic or demographic entities, but choosing one of these entities only allows for one way of slicing the data, which might not be relevant in most cases. Added to this, you have to take into account that developing and maintaining a partitioning schema is not a trivial task. It requires advanced database administration skills to set this up and integrate the partitioning scheme with the data loading processes for the data warehouse. For taming monster dimensions, the combination of vertical partitioning and the use of additional mini-dimensions seems, therefore, the best available alternative. To determine which columns to split off for the mini-dimension(s) you’ll need to look at a couple of things:

  • Similarity—Which columns contain similar information or information that is logically grouped together, such as demographic attributes.
  • Cardinality—How many different values can occur in a single column? Columns of low cardinality (such as gender) are ideal candidates for mini-dimensions.
  • Volatility—How often do values in a column change over time? Names don’t change very often; demographic attributes such as age or income group do.