пятница, 19 сентября 2014 г.

Handling Dimension Changes


Dimensional data changes constantly. Customers move, new products are introduced, employees get a salary raise, warehouses get new managers, and so on. One of the main challenges when building a data warehouse is how to adapt to these changes in such a way that all relevant history is captured and all transactions are linked to the correct historical dimension records. Some of these changes occur frequently, others only occasionally. Some changes affect a single record, and some involve complete tables, such as introducing a new customer classification. And some changes are relevant to store in a historically correct way for the organization, while others can simply replace old data. For all of these situations, a different strategy can be applied, and together they form the different ways of handling slowly changing dimensions (SCDs). In his first edition of The Data Warehouse Toolkit, Ralph Kimball introduced three different SCD strategies: overwrite, stack, and add. These strategies are now respectively called SCD type 1, 2, and 3 and will be described shortly. Over the years, many other people from the database modeling community added new types and variations to the original SCD types; the current approach covers the handling of all possible changes that can be applied to source data and how to handle these changes most effectively in a data warehouse.
The term slowly changing dimensions is slightly misleading in the sense that you might get the impression that these types should be applied to the complete dimension table. This is an incorrect assumption which we’d like to address before we go on. Each column in a dimension table should be treated individually. So it is possible that for some columns a type 1 is used, while for others the type 2 approach will be more appropriate. Other strategies or combinations can also be used, but identifying which strategy should be applied to each column is an important part of the design process. 
The story doesn’t begin with type 1, however. There is also a type 0, meaning either ‘‘do nothing’’ or ‘‘replace completely.’’ In both cases, this is not an advisable strategy, unless it concerns static dimension tables. The most prominent example of these are the Date and Time dimensions. It’s highly unlikely that the Year value of the date Jan. 20, 2008 will ever change. What can change, however, is a Fiscal Year classification but in that case values usually get overwritten (type 1) or stored in an additional column (type 3).

Not all dimension attributes will be used for analytical purposes and thus not all are of interest for historical correctness. Which attributes are analytical and which are details differs per industry and organization. In some cases, a phone number is just a customer dimension detail that can be overwritten when a new value is extracted from the source system. In other cases, the phone number is being used for analytical purposes and changes need to be tracked. When we analyze sales, it’s usually sufficient to group by city, not necessarily by street or address. In that case, the column city is an analytical attribute and the column address is a plain detail. 


SCD Type 1: Overwrite
A type 1 slowly changing dimension is the most basic one and doesn’t require any special modeling or additional fields. SCD type 1 columns just get overwritten with new values when they come into the data warehouse. Picture shows what a dimension record looks like before and after applying a type 1 dimension change. In this case, the customer moved to a new address on a certain date. 


Without additional knowledge, you have no way of identifying when this change happened, and even if you insert an extra column with a last modified timestamp you only know that something changed inside the record, not in which column this change took place. A type 1 overwrite approach is used for columns that are of interest for our users (otherwise the columns wouldn’t have been there in the first place), but only the current status of the data is relevant, even when looking at older transactions. Remember that when you overwrite a column like this and run a query, all results will show the content of the column as it is now, not the earlier value. 


SCD Type 2: Add Row
Type 2 is not ‘‘the next step’’ in terms of functionality or complexity but is actually a category of its own. All ETL vendors went out of their way to support type 2 SCDs with wizards, macros, and other add-ins over the last couple of years so nowadays you’ll be hard-pressed to find a tool without this support. Of course, Kettle/PDI also has this functionality in the Dimension Lookup/Update step. But what exactly does it do? SCD type 2 is about history preservation and enables an organization to capture changes to a dimension table to retrieve historical correct data when querying the data warehouse.
 Table 1 shows the same example as in the previous paragraph, but now you can track changes through time by adding a few extra fields. There are multiple ways of modeling this or storing multiple versions; the most basic one is to add only a valid_from timestamp to the dimension record. Omitting the corresponding valid_to timestamp adds extra complexity when trying to retrieve the correct version of the dimension entry, so mark this as a mandatory field for this type of change as well. Two other extra columns can often be found in type 2 supporting tables: a current_record column indicating the current version of the dimension record, and a sequence or version number that is incremented each time a new version of the record is added.


Now you can do lots of interesting things with this data. Suppose Mr. or Mrs. Humphries is a regular customer and orders something every month. What happens in your fact table when these transactions are loaded is that the ETL process looks at the valid customer record for each particular customer at the time of loading. This means that all order fact rows for the customer with ID 22321 (the source system customer number) will store customer_key 1 until May 1, 2008, and use customer_key 2 from that day on until the next change for this customer is applied. The example fact table is displayed in Table2. 

Now when you want to know how much revenue was generated in Toronto in 2008, and you run this query in September, the condition is where customer_city = 'Toronto’. The dimension record with the value 1 for customer_key is the only record satisfying this condition. And because the join is on customer_key, only the first four rows of the fact table are retrieved for this customer. When the condition is where customer_name = 'Humphries’, both customer_id 1 and 2 satisfy the condition and all the fact rows will be returned.


PROCESSING TYPE 2 SCD DATA
An ETL process, whether hand coded or coded with a tool, needs to have several capabilities for handling type 2 dimension changes. First of all, the incoming data must be compared to the data that already exists in the dimension table. The output of this process is the incoming data set with flags added for new rows (I for Insert) to be inserted, existing rows that need to be updated (U for Update), and possibly even deleted rows that don’t exist anymore in the source system (D for Delete). Based on the I, U, or D flags, the process then needs to assign new surrogate keys in case of inserts or updates. Inserts are easy: the new row can be added with the default settings for valid_to and current_record. For an update, additional processing is required: the existing row needs to be detected (that’s why the current_record indicator is so useful), the valid_to timestamps need
to be set to the correct value, and the current_record flag has to be set
to a value of 0, N, or whatever logic you have designed for it. Then a new record with the updated data has to be created, a surrogate key generated, and the appropriate timestamps and current_record flag need to be set. Because deleting dimensional data is out of the question (it probably still is referenced by existing facts) the records with a delete flag get their valid_to timestamp set. In addition, another indicator can be used to mark the record as deleted. A possible solution is to use a value other than 0 or 1 for the current_record flag. 


Of course there is more to capturing history in a data warehouse; this is just the plain version of historical correctness. For instance, think about the following issue: what point in time do we use for the valid_from and valid_to timestamps? Is it the time the change was entered in the source system? The time the data was loaded in the data warehouse? The actual time in the real world when the event took place? Or the time when we were notified of this event? The third option might seem the most appropriate version and the best representation of the actual event, but how do we track this? In some industries, such as the insurance business, all these timestamps need to be stored and the data warehouse should have a full account for the history of history, as it is called, as well.


SCD Type 3: Add Column
The type 3 strategy requires at least one extra column in the dimension table. When the data for a type 3 column changes, the existing value is copied to the extra _old column while the new value is placed in the regular column. Table shows an example of this. 


This might seem an odd way of keeping previous values at first, and in most cases it is. It’s only possible to handle one previous version. Storing additional changes requires an extra column for each version you want to keep. But imagine that your organization structure changes completely, or a completely new product group structure is introduced. In those cases, where all records change at the same time, it makes sense to use an extra column. Handling these massive changes with a type 2 scenario doubles the number of records in your dimension table and in most cases, only the new structure of the data is relevant. The old version is just kept for reference purposes or as a translation table.


SCD Type 4: Mini-Dimensions
Kimball’s dimensional modeling introduces the term mini-dimension. Some sources claim this to be a type 4 scenario; others use the term ‘‘type 4’’ for other purposes. The notions of type 4 and 5 SCDs were introduced in 1998 by Michael Schmitz, a renowned expert in data warehousing and dimensional modeling. We conform to this classification here, in contrast to other sources such as Wikipedia, which uses a different categorization.
Mini-dimensions solve two particular problems with changing dimension tables. One problem occurs when dimension tables get really big, say a customer dimension with 150 million rows (they exist!). The second problem occurs when changes happen very frequently, causing the dimension table to double or triple in size each year. The trick here is to first identify which analytical attributes change very frequently and put these as a group in one or more separate dimension tables. The result of this is one or more extra dimension keys in the fact table. 

Here’s an example. Suppose you have a large customer dimension table with the attributes city, region, country, gender, birth_date, and income. The first three fields can be categorized as geographic data; the latter three have a more demographic nature. Of course, the gender description doesn’t change very often but is probably one of the key attributes for analytical purposes. These six fields can therefore be put in two different mini dimensions, a dim_geography and a dim_demography.
Mini-dimensions do honor to their name: They usually are very small, not only in number of records but also in number of attributes, as in this example. There is, however, a tradeoff involved in order to keep this number of records as small as possible. When an attribute such as income is used in a mini-dimension, it’s impossible to store all possible different values so you need to work with banded values or ranges. Remember that when using fields A, B, and C for a mini-dimension, the number of records is determined by multiplying the number of possible values of A, B, and C. So if each age from 0 to 100 is used, the multiplier is already 101. It makes more sense to define age and income ranges, maybe 10 of each, resulting in 100 records. Multiplied by the three possible values for gender (male, female, and unknown) the number of rows in the mini-dimension will be 300, which is really small by any standard. Mini-dimensions with 100,000 rows are not uncommon and with the current state of technology also not a problem. If the mini-dimensions get any larger than that, it’s advisable to redesign the dimension table and maybe split it up again into smaller dimensions.
Table shows an example data model with a fact table and three dimension tables of which two are mini-dimensions.
In order to make this work you need to identify the correct mini-dimension key when loading the facts, which requires extra overhead. For instance, to determine the age group a customer belongs to, you need to calculate the age based on the load date and the customer’s birth date. The same goes for income band and all other attributes needed to determine the correct mini-dimension key. The payoff, however, is huge. Suddenly, many fewer changes can occur to your ‘‘main’’ dimension table. History is perfectly covered in the fact tables using the mini-dimension foreign keys, and to top it off, you also added the current mini-dimension key to the main dimension table. This last addition enables the user to use the mini-dimensions in conjunction with the main dimension without the need to query the fact table. In fact, this model serves two purposes: It gives the current dimension record value, which is very useful for selecting target groups—for example, for direct marketing campaigns—while retaining the full history of the analytical attributes in the fact table. 


SCD Type 5: Separate History Table
So far, the dimension changes have affected the way you look at your query results with respect to the corresponding fact rows. The different strategies enable you to identify which version of a dimension record was valid at the time of a transaction, be it a sale, an inventory movement, or some other business event. Type 5 is a little bit different because it cannot be used to run analytical queries that use the fact table as well. With type 5, a separate history table is created for a dimension table with the sole purpose of correctly capturing all changes to all attributes in the dimension table. A type 5 strategy is therefore an addition to the existing SCD types and should be used in conjunction with one or a combination of the other strategies. Type 5 history tables should therefore not be used for analytical queries involving fact tables. Table shows an example of this type of table together with the parent dimension table. 


This almost looks like a fact table, and indeed it has some of the same characteristics as a fact table. However, there are no real facts: There’s nothing to summarize because there are no measurable items. The only available option is to count the number of customers or changes, but that could be an interesting enough activity: How many times do customers get a new e-mail address? What’s the average moving rate?
Also note that the diagram in Figure 7-10 contains a change type table that serves as a lookup table that indicates exactly what values have changed. Next table shows the first few rows of this table so that you can get a feeling of what this data is all about.
Change types are also a perfect indicator for data quality issues. Look, for instance, at type number 4 where only the city name is changed. It’s highly unlikely that this happens without the change of address, postal code, and possibly telephone number as well. Regardless of whether you overwrite existing values in your customer dimension, changes should be tracked in the history table. The end effect is a full audit trail of changes to dimension data. Remember, however, that this serves a different purpose than the regular dimension and fact tables used for reporting and analysis.
The example described previously is only one of the many possible solutions for modeling a full data audit trail. Another common approach is to mimic the change or log tables used by many database management systems for replicating and synchronizing data between two or more databases. Besides the changed data, these change tables contain the record key, change timestamp, and the change type indicator I, U, or D for Insert, Update, and Delete. Sometimes before and after images are both present in these tables, comparable to a type 3 SCD scenario. 

SCD Type 6: Hybrid Strategies 

Type 6 doesn’t really exist but is sometimes described as 1+2+3=6, indicating that it is a mix of different strategies applied to a single dimension table. Kimball gives an example of this in The Data Warehouse Lifecycle Toolkit. A common business request is ‘‘I want to see all product revenue according to the current version of the product group.’’ This query cannot be answered if a product group is handled as a type 2 attribute, which is usually the case. One way to solve this is to add an extra product attribute to the dimension table where the current version of the product group is stored for all versions of the product. Table above illustrates this feature.  By using this modeling technique, you can still store the correct version of the product in your fact table for historical correctness, but this technique also enables a user to look at this information as if all history were based on the current product group classification. This is, however, not a very flexible solution and works only for one or a few attributes. Another way of solving this problem is joining the same dimension table on the source ID, and restricting the extra joined table on the current record. By doing this, all attributes can be used ‘‘as is’’ while at the same time retrieving the full history from the fact table based on the SCD keys. The last option is to add an additional dimension table with only the current values in it. This current dimension table will then get its own key in the fact table. A drawback of this solution is that you need an extra key column in the fact table for each current dimension you add this way.