It is highly recommended including audit columns in the data warehouse. These
columns enable you to trace back data to its original source, and tell you
when a certain row was inserted or modified and who or which process
executed the operation. Under normal operation, the ETL process takes care
of all modifications to the data in the data warehouse, but sometimes it might
be necessary to manually correct something. When this occurs in a carefully
constructed dimension table, it’s important to have the table record these
modifications automatically. For auditing reasons we recommend using the
following four columns (at least in the dimension tables):
Usually data in a data warehouse is loaded and updated in batches. For those situations, we also recommend using a separate batch table, which stores information about the start and end time of the batch, number of records processed, the machine the process ran on, the name and version of the ETL tool that processed the data, and the system profile (development, test, and production) used. Pentaho Data Integration enables you to create these log tables on both the job and the transformation level.
The combination of audit columns and a batch ID helps you find and correct data problems in the data warehouse more easily. They can also be used to prove the correctness of data or, even more important, the correct loading of incorrect data from the source systems. In 99 percent of the cases, data problems are due to errors in the source system, not to errors in the ETL process or the data warehouse itself. When faced with the inevitable discussions about the reliability of the data in the data warehouse, audit columns are your safety net.
- Insert timestamp
- Insert batch process
- Update timestamp
- Update batch process
Usually data in a data warehouse is loaded and updated in batches. For those situations, we also recommend using a separate batch table, which stores information about the start and end time of the batch, number of records processed, the machine the process ran on, the name and version of the ETL tool that processed the data, and the system profile (development, test, and production) used. Pentaho Data Integration enables you to create these log tables on both the job and the transformation level.
The combination of audit columns and a batch ID helps you find and correct data problems in the data warehouse more easily. They can also be used to prove the correctness of data or, even more important, the correct loading of incorrect data from the source systems. In 99 percent of the cases, data problems are due to errors in the source system, not to errors in the ETL process or the data warehouse itself. When faced with the inevitable discussions about the reliability of the data in the data warehouse, audit columns are your safety net.