techBI: Using Surrogate Keys

Each table in a database usually has a primary key, which is the unique identifier of a record. This key can consist of one or more columns, and any data type can be used for these columns. The databases used as a source system for a data warehouse often contain primary keys consisting of multiple columns. These primary keys come in all sorts and are either database-generated or user-supplied. The term used to refer to these source system keys is natural key, as opposed to the term artificial or surrogate key used in a data warehouse. Natural keys often contain information about the nature of the record they’re referring to. A product key might therefore consist of multiple parts indicating things like department, model number revision number, and product type. When a key comprises a combination of such parts and the key alone is enough to reveal to a user what the data is about, it is also referred to as a smart key.

From a data warehouse point of view, there’s nothing smart about a smart key; they take up unnecessary space and are hard to build and maintain indexes on. In a data warehouse, surrogate keys should be used, which is perhaps the most important design principle when building a data mart using star schemas. A surrogate key is a database-generated identifier without any inherent meaning. Its sole purpose is to uniquely identify a dimension record using the smallest possible data type. The primary key of a dimension table therefore always consists of a single column. This is important because fact records can usually be identified by the combination of the primary keys of the dimension tables.
When you look at the diagram in Figure 7-1, you will see that each sales fact (an individual sales transaction) has five foreign keys, to time, product, customer, store, and promotion. Now suppose that for ‘‘time’’ you use a datetime data type and for all the other keys their original source system primary key. Suppose also that these other keys are ‘‘smart’’ and occupy 15 characters, which translates to 15 bytes. You end up with a 68 byte key for the fact table (4 × 15 plus 8 bytes for the datetime). With 100 million fact rows, you need approximately 6.5 gigabytes to store this information. Using surrogate keys, this can be slimmed down to 20 bytes per key (5 integers) resulting in 1.9 GB. That’s 4.6 GB less disk space to occupy and also (which is more important) less disk I/O when a query is executed.
Surrogate keys have other advantages as well:

There’s always only a single column key for each dimension table so the resulting primary key index will be smaller.
Integer indexes are usually a lot faster than character or datetime indexes.
They enable the storage of multiple versions of an item where the item retains its original source key but is allotted a new surrogate key.

They allow for dealing with optional relations, unknown values and irrelevant data, so therefore you can avoid using outer joins in your queries.

Surrogate keys can be generated in two ways: by using database functionality (auto-increment values or sequences) or by using the ETL tool to generate next key values. We prefer the latter because some ETL tools need special configuration for handling database-generated keys.

пятница, 19 сентября 2014 г.

Using Surrogate Keys

пятница, 19 сентября 2014 г.