пятница, 19 сентября 2014 г.

Audit Columns


It is highly recommended including audit columns in the data warehouse. These columns enable you to trace back data to its original source, and tell you when a certain row was inserted or modified and who or which process executed the operation. Under normal operation, the ETL process takes care of all modifications to the data in the data warehouse, but sometimes it might be necessary to manually correct something. When this occurs in a carefully constructed dimension table, it’s important to have the table record these modifications automatically. For auditing reasons we recommend using the following four columns (at least in the dimension tables):

  • Insert timestamp 
  • Insert batch process 
  • Update timestamp 
  • Update batch process

Usually data in a data warehouse is loaded and updated in batches. For those situations, we also recommend using a separate batch table, which stores information about the start and end time of the batch, number of records processed, the machine the process ran on, the name and version of the ETL tool that processed the data, and the system profile (development, test, and production) used. Pentaho Data Integration enables you to create these log tables on both the job and the transformation level.
The combination of audit columns and a batch ID helps you find and correct data problems in the data warehouse more easily. They can also be used to prove the correctness of data or, even more important, the correct loading of incorrect data from the source systems. In 99 percent of the cases, data problems are due to errors in the source system, not to errors in the ETL process or the data warehouse itself. When faced with the inevitable discussions about the reliability of the data in the data warehouse, audit columns are your safety net.