In this blog, we will share the challenges that we faced with the default behavior of the dbt snapshot feature and the solution that has been implemented to overcome the same by customizing the snapshot macros.
The requirement is to snapshot the data present in the Employee table into the target table named Employee_History, with below-required changes to be done in the snapshot model.
1.The columns dbt_updated_at and dbt_scd_id which dbt generates by default should not be present in the snapshot table.
2.The columns dbt_valid_to and dbt_valid_from which dbt generates by default should be renamed to valid_to and valid_from and their data type should be DATE instead of the default data type TIMESTAMP.
3. Whenever a new record is inserted, the column valid_to has to be populated as ‘9999-12-31’, instead of the default value NULL.
4. As far as an update in the snapshot table is concerned, even if the source table has n number of columns, the snapshot table should be updated if there is a valid update in the columns selected in the snapshot model.
5. While snapshotting the data, the columns (like sequence, timestamp) whose values are generated in run time should not be considered. But the source data should be checked against target data for updates.
The above conditions can be satisfied by overriding the default behavior of the dbt – snapshot materialization and its related macros.
There are two strategies offered by dbt to snapshot the data, namely, Timestamp and Check. In this case, we cannot proceed with the Timestamp strategy since our fourth requirement requires us to check for updates column by column. So, we are overriding macros related to the check_cols strategy.
Macros to be customized in dbt: The snapshot materialization, along with the following macros must be to be overridden to implement this change.
For example, to meet the requirements 1 and 2, we must modify the following macro to suit our needs.
Before change: The macro build_snapshot_table builds a SQL query which in turn is used for the creation of the snapshot table. We see that the columns dbt_scd_id, dbt_updated_at, dbt_valid_from and dbt_valid_to are added by dbt when a new snapshot table is created.
After change: In the modified macro, the columns dbt_scd_id and dbt_updated_at are removed from the query. The columns dbt_valid_from and dbt_valid_to are renamed to valid_from and valid_to respectively and are converted to DATE.
Demonstration(requirements 1, 2 and 3)
Let us consider that we have a source table Employee as seen below.
After the snapshot model is run for the first time, the target table Employee_History contains the following records selected from the Employee table. We could see that the requirements 1, 2 and 3 are met here.
Demonstration(requirements 4 and 5)
Let us make some changes in the Source table. The records that are either inserted or updated in the source table are highlighted in yellow.
After the snapshot model is run for the second time, the target table Employee_History is as seen below.
Even though there is a change in the column CITY for two records(EMPID=1 and EMPID=2), they did not make it to the snapshot table since the column CITY is not selected in the snapshot model. Hence, we have met requirement 4.
Here, the values for the columns, employee_history_sk and loaded_at are generated at run time. The macro is customized so that the columns loaded_at (a timestamp column) and employee_history_sk (a surrogate key generated by sequence) are ignored while checking for updates in the source table. Hence, we have met requirement 5 as well.
We sincerely hope that this blog helped you in understanding how to override macros to suit our requirements. You can download the code for default dbt macros and materialization for your customization. Please click the link to explore more blogs on Snowflake.