Assume, you have the 150 million record customer table before. Maybe your
customer dimension isn’t that big, but even with 5 million records, maintaining
the dimension table in a limited batch window can be challenging enough.
We’ve shown one of the solutions to make a monster dimension better
manageable and optimize query performance: the mini-dimensions. Still,
when your monster dimension contains a lot of detail columns that will not be
used for analytical purposes anyway, it might be a good idea to split them into
a separate table, also called vertical partitioning. The query speed in a row-based
database like MySQL is not only determined by the number of rows, but also
by the total byte size of each row. When a column-based database such as
Infobright, LucidDB or MonetDB is used, this penalty is eliminated by the
fact that the data is already stored in a column-by-column fashion, thereby
eliminating the need to partition the table yourself.
A third option is to partition the table horizontally. This is a quite common
technique for fact tables that are usually partitioned by time period, but is not
used very often for dimension tables. The problem with horizontal partitioning
is that a partition key needs to be defined in such a way that the partitioning
makes sense. The benefit of horizontal partitioning lies in the fact that when
a query is parsed, the optimizer can determine which partitions to use and
which not, based on the partition key. Most queries involve a date/time
attribute, so when using a time-based partitioning scheme, this is easy. If your
data is partitioned by month and the query affects only last month’s sales, the
optimizer can select a single partition to retrieve the data from.
With customer data it’s not that simple. You cannot use a date field to partition by because no available date is relevant to your analytical queries. Other attributes that can serve as a candidate partitioning key might be geographic or demographic entities, but choosing one of these entities only allows for one way of slicing the data, which might not be relevant in most cases. Added to this, you have to take into account that developing and maintaining a partitioning schema is not a trivial task. It requires advanced database administration skills to set this up and integrate the partitioning scheme with the data loading processes for the data warehouse. For taming monster dimensions, the combination of vertical partitioning and the use of additional mini-dimensions seems, therefore, the best available alternative. To determine which columns to split off for the mini-dimension(s) you’ll need to look at a couple of things:
With customer data it’s not that simple. You cannot use a date field to partition by because no available date is relevant to your analytical queries. Other attributes that can serve as a candidate partitioning key might be geographic or demographic entities, but choosing one of these entities only allows for one way of slicing the data, which might not be relevant in most cases. Added to this, you have to take into account that developing and maintaining a partitioning schema is not a trivial task. It requires advanced database administration skills to set this up and integrate the partitioning scheme with the data loading processes for the data warehouse. For taming monster dimensions, the combination of vertical partitioning and the use of additional mini-dimensions seems, therefore, the best available alternative. To determine which columns to split off for the mini-dimension(s) you’ll need to look at a couple of things:
- Similarity—Which columns contain similar information or information that is logically grouped together, such as demographic attributes.
- Cardinality—How many different values can occur in a single column? Columns of low cardinality (such as gender) are ideal candidates for mini-dimensions.
- Volatility—How often do values in a column change over time? Names don’t change very often; demographic attributes such as age or income group do.