Page MenuHomePhabricator

Create example dbt models using Iceberg
Closed, ResolvedPublic

Description

We need to create some example models with dbt that can read from our Hive and Iceberg tables and write in new Iceberg tables.

These models should include:

  • Partition defined
  • Merge strategy
  • Incremental condition, so they doesn't overwrite the whole table on each run.

Event Timeline

Created a MR with a couple of examples of Iceberg tables.

https://gitlab.wikimedia.org/repos/data-engineering/dbt-jobs/-/merge_requests/3

The code is working on stat machines:

15:03:21  1 of 2 OK created sql incremental model javiermonton.iceberg_insert_overwrite .. [OK in 17.10s]
15:03:21  2 of 2 START sql incremental model javiermonton.iceberg_merge .................. [RUN]
15:03:36  2 of 2 OK created sql incremental model javiermonton.iceberg_merge ............. [OK in 15.36s]
15:03:36  
15:03:36  Finished running 2 incremental models in 0 hours 0 minutes and 44.08 seconds (44.08s).
15:03:36  
15:03:36  Completed successfully
15:03:36  
15:03:36  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=2
Ottomata renamed this task from Create example models using Iceberg to Create example dbt models using Iceberg.Oct 29 2025, 3:15 PM

One thing I wish to highlight here is that for the moment on most of our Iceberg tables we don't use a merge strategy for our incremental runs, but rather a delete + insert, that allows us to be more idempotent in case of jobs reruns (see below).
I suggest we spend some time in spiking to check if it could be possible to replicate the existing strategy with DBT, and think carefully of the impact if we need to change the strategy :)

Details on job idempotency: when re-rerunning a job for a certain timeframe, we wish to be sure that the impact on data is that existing data for that timeframe is entirely replaced by new data, not updated. We need to verify that this can be achieved with an Iceberg merge without too much complexity :).

Then we could discard the merge strategy and use the insert_overwrite, which seems to do what you were describing. If we are not completely confident, I guess we can try to debug an example and check what exact queries is dbt generating.

insert_overwrite is what @JAllemandou is describing, perfect.

To my understanding, the insert_overwrite strategy only works for Iceberg tables when the time-partition of that table matches the loading job regularity. This is because insert_overwrite in Iceberg overwrites entire partitions if some data is written to them.
The related concern is that we wish to NOT make iceberg time-partitioning match loading job frequency, to optimize small file issues for instance. So in those cases (uniaue_devices tables being one of them), we need a different strategy, and the question I had is whether merge can safely replace delete + insert. If it can, great, let's update to using that. Otherwise we need to make DBT replicate our delete + insert way.

That specific use-case sounds like what dbt calls a microbatch incremental strategy that replaces time intervals given the event_time column: https://docs.getdbt.com/docs/build/incremental-microbatch

I don't think the micro_batch strategy changes anything to our case. It helps when reprocessing big amount of temporarily partitioned data only with DBT: DBT split the jobs in smaller instances, in a similar way we would do with Airflow.