When working with dbt, different materializations can be used. The simplest are table and view, but for example, the table materialization overwrites the whole table on each execution, so in most of the pipelines, we will be using incremental materializations.
Incremental materializations aims to insert data on each execution without rewriting the whole table, and they can do in different ways, which depend on some configurations and on the adapter used.
Materializations have a config called incremental_strategy. When working with Spark, the default strategy is Append. insert_overwrite and merge can be used too.
The Append strategy will insert new rows on each execution. If the incremental condition ({% if is_incremental() %} is correctly setup (checking select max(updated_date) from {{ this }}) it can work well, but it won't process records that arrive out of date on the source.
The insert_overwrite strategy uses insert overwrite table ... partition (...), but it can give problems on Iceberg tables if the source and destination partitions are not based on the same dates.
The merge strategy uses merge into ... statements and it requires a unique key to be defined, so it can overwrite the unique records, but it can give other problems on our Iceberg tables.
We should explore and test the different strategies to define a guide on how and when to use them properly.
The task is done if:
- We have a document / Readme explaining how and when use each strategy, with some examples.