Page MenuHomePhabricator

Create a custom DBT materialization macro
Open, Needs TriagePublic

Description

DBT comes with a number of model materializations, but our common pipeline run pattern is unfortunately not represented in them. Most of Data Platform foundational datasets are produced on well-defined time boundaries. Daily datasets contain only data relevant to the actual date they represent. Weekly datasets contain the 7 days worth of data ending with the dataset date they represent. Ditto for monthly and yearly datasets. The DBT materialization type closest to our usage pattern is the incremental materialization, but it selects time chunks of data starting at the last successful run (whenever that might have been) and ending at the current time, causing dataset results that are up to date but are not bounded by uniform time intervals.

Fortunately, DBT supports user-created custom materializations.

Create a custom DBT materialization that supports our hourly, daily, weekly, monthly and yearly usage patterns through user-provided start and end date/time stamps.

Done is these items provided in dbt-jobs macro library:

  • A custom DBT materialization named wmf_timebound (NOTE: Any better naming suggestions would be very much appreciated)
  • A required materialization configuration key named frequency that accepts values of [hourly, daily, weekly, monthly, yearly]
  • Required execution_date type of variable that functions like end_date in our current pipelines.
  • A way of configuring the handling of the date/time column(s) in the query source. This part will probably be tricky since our datasets vary a lot in how they represent dates - various naming conventions, column types, and dates could sometimes even be broken into multiple columns.

Event Timeline

I just found that dbt has now something called microbatch incremental models that might be just what we are looking for.

They say:

Unlike traditional incremental strategies, microbatch enables you to reprocess failed batches, auto-detect parallel batch execution, and eliminate the need to implement complex conditional logic for backfilling.

Maybe we could look at it before writing our custom materialization.

@JMonton-WMF microbatch incremental strategy looks exactly like dbt's answer to the common batching practice we employ. I've read some docs on it just now and it seems like we could easily fit its usage into our common usage patterns, with the caveat that our models would need the event_time column which they sometimes lack unfortunately.

Great find, I think we just might not need to write a new materialization at all!

@JMonton-WMF microbatch incremental strategy looks exactly like dbt's answer to the common batching practice we employ. I've read some docs on it just now and it seems like we could easily fit its usage into our common usage patterns, with the caveat that our models would need the event_time column which they sometimes lack unfortunately.

Great find, I think we just might not need to write a new materialization at all!