wmf_content.mediawiki_content_history_v1 serves as a replacement for wmf.mediawiki_wikitext_history.
In this task we should PoC what a replacement for wmf.mediawiki_wikitext_current looks like.
We hypothesize that it should be easy and cheap to use Iceberg's CDC-like interface to consume changes between from wmf_content.mediawiki_content_history_v1's snapshots.
So we should:
- Create a new wmf_content.mediawiki_content_current_v1 table.
- Backfill wmf_content.mediawiki_content_current_v1 with the current revisions.
- PoC a job that consumes the CDC-like interface (namely create_changelog_view()) and merges into the new wmf_content.mediawiki_content_current_v1.
- Make sure we document the steps for such a consumption clearly, so that it serves as a example for future users of this interface.
Conclusion: TL;DR: Although interesting, Iceberg's create_changelog_view() is not currently the best mechanism for our use case. We will instead use traditional SQL to calculate changes on the history table compared to the current table via an (optimized) full table scan.
Longer:
I think the best way to move forward here to populate a new wmf_content.mediawiki_content_current_v1 table is to do a full table scan of wmf_content.mediawiki_content_history_v1 via regular SQL to calculate changes and apply them via a MERGE INTO, instead of using the create_changelog_view() mechanism.
The reasons are as follows:
- The performance of doing it via SQL is better (change calculation is ~10 mins vs ~54 mins with 64 2-core executors), and I think the SQL logic is significantly easier to follow vs the procedure plus the window functions that we need to do to achieve our semantics. Even if we had the opposite performance, I think I would prefer readability in this case.
- create_changelog_view() requires us to keep state, otherwise, a failed run for any particular day means we will always be missing updates until we fix the daily run. The cookbook example (https://www.tabular.io/apache-iceberg-cookbook/data-engineering-incremental-processing/) suggests persisting snapshot metadata of the source table on the target table. All of this is not needed if we do it via SQL and a full table scan, as we will be automatically applying detected changes, and in the event of failures, a future run will take care of reconcile automatically. This further simplifies logic.
In the future, if we want to produce, say, hourly runs, we may want to revisit this decision.
I am making my notebook with my tests available as an attachment to this phab ticket.