DBT comes with a number of model materializations, but our common pipeline run pattern is unfortunately not represented in them. Most of Data Platform foundational datasets are produced on well-defined time boundaries. Daily datasets contain only data relevant to the actual date they represent. Weekly datasets contain the 7 days worth of data ending with the dataset date they represent. Ditto for monthly and yearly datasets. The DBT materialization type closest to our usage pattern is the incremental materialization, but it selects time chunks of data starting at the last successful run (whenever that might have been) and ending at the current time, causing dataset results that are up to date but are not bounded by uniform time intervals.
Fortunately, DBT supports user-created custom materializations.
Create a custom DBT materialization that supports our hourly, daily, weekly, monthly and yearly usage patterns through user-provided start and end date/time stamps.
Done is these items provided in dbt-jobs macro library:
- A custom DBT materialization named wmf_timebound (NOTE: Any better naming suggestions would be very much appreciated)
- A required materialization configuration key named frequency that accepts values of [hourly, daily, weekly, monthly, yearly]
- Required execution_date type of variable that functions like end_date in our current pipelines.
- A way of configuring the handling of the date/time column(s) in the query source. This part will probably be tricky since our datasets vary a lot in how they represent dates - various naming conventions, column types, and dates could sometimes even be broken into multiple columns.