Page MenuHomePhabricator

Spike: Figure how best to produce wmf_content.mediawiki_content_current_v1
Closed, ResolvedPublic

Description

wmf_content.mediawiki_content_history_v1 serves as a replacement for wmf.mediawiki_wikitext_history.

In this task we should PoC what a replacement for wmf.mediawiki_wikitext_current looks like.

We hypothesize that it should be easy and cheap to use Iceberg's CDC-like interface to consume changes between from wmf_content.mediawiki_content_history_v1's snapshots.

So we should:

  • Create a new wmf_content.mediawiki_content_current_v1 table.
  • Backfill wmf_content.mediawiki_content_current_v1 with the current revisions.
  • PoC a job that consumes the CDC-like interface (namely create_changelog_view()) and merges into the new wmf_content.mediawiki_content_current_v1.
  • Make sure we document the steps for such a consumption clearly, so that it serves as a example for future users of this interface.
  • More generally, what is the replacement for the wmf.mediawiki_wikitext_current?

Although not planned yet, a new version of wmf.mediawiki_wikitext_current is possible as a downstream table from wmf_dumps.wikitext_raw as well. We can, again, just consume wmf_dumps.wikitext_raw in a CDC fashion and SQL MERGE into the new table the changes. This would be a cheap alternative to the current wmf.mediawiki_wikitext_current, which is a separate job that calculates everything every month.


Conclusion: TL;DR: Although interesting, Iceberg's create_changelog_view() is not currently the best mechanism for our use case. We will instead use traditional SQL to calculate changes on the history table compared to the current table via an (optimized) full table scan.

Longer:
I think the best way to move forward here to populate a new wmf_content.mediawiki_content_current_v1 table is to do a full table scan of wmf_content.mediawiki_content_history_v1 via regular SQL to calculate changes and apply them via a MERGE INTO, instead of using the create_changelog_view() mechanism.

The reasons are as follows:

  • The performance of doing it via SQL is better (change calculation is ~10 mins vs ~54 mins with 64 2-core executors), and I think the SQL logic is significantly easier to follow vs the procedure plus the window functions that we need to do to achieve our semantics. Even if we had the opposite performance, I think I would prefer readability in this case.
  • create_changelog_view() requires us to keep state, otherwise, a failed run for any particular day means we will always be missing updates until we fix the daily run. The cookbook example (https://www.tabular.io/apache-iceberg-cookbook/data-engineering-incremental-processing/) suggests persisting snapshot metadata of the source table on the target table. All of this is not needed if we do it via SQL and a full table scan, as we will be automatically applying detected changes, and in the event of failures, a future run will take care of reconcile automatically. This further simplifies logic.

In the future, if we want to produce, say, hourly runs, we may want to revisit this decision.

I am making my notebook with my tests available as an attachment to this phab ticket.

Event Timeline

xcollazo added a subscriber: Ottomata.

CC @Ottomata, thought you might find this mechanism interesting.

Oh my yes please. Are we planing on doing this? If so, let's evaluate the various Iceberg + CDC tools out there.

Are we planing on doing this?

It was pointed out on T358366 that it would be nice to have a replacement for wikimedia_wikitext_current as well.

If so, let's evaluate the various Iceberg + CDC tools out there.

This particular task is to PoC Iceberg's create_changelog_view() mechanism, which is built in the Spark Iceberg jar and only meant to create CDC-like rows coming from an Iceberg table after a write. So we do a regular MERGE INTO to wmf_dumps.wikitext_raw, and we use this mechanism that creates a temporary VIEW over those changes, to be ingested downstream.

AH! I see! Sorry I thought this was about CDC ingestion of data into Iceberg. This is about CDC of Iceberg tables themselves. Very interesting. Proceed!

Moving this to Phase III. We don't have the time to do it on Phase II.

Noting here that there are multiple immediate use cases for having a wmf_content.mediawiki_content_current_v1 table:

xcollazo renamed this task from Use the Spark-Iceberg built in CDC mechanism to PoC a replacement for wikimedia_wikitext_current to Use the Spark-Iceberg built in CDC mechanism to PoC a replacement for wmf.wikimedia_wikitext_current.Mar 19 2025, 6:13 PM
xcollazo updated the task description. (Show Details)

+1 for wmf_content.mediawiki_content_current_v1

@JAllemandou how might this look if we did the snowflake 'metadata' page table ideas?

  • mediawiki_page_history - 'metadata'
  • mediawiki_page_current - just current page state 'metadata'
  • mediawiki_page_revision_content - immutable page content?
  • mediawiki_page_html_content, mediawiki_wikibase_page_entity_content, mediawiki_page_prediction_revert_risk, etc.?

I'm only asking in case we don't want to duplicate the revision content in another table? Would having mediawiki_page_current be enough to link back to existent mediawiki_content_history_v1 by current revision_id?

(BTW, maybe the more correct term for this is 'star' rather than 'snowflake'? @JAllemandou ?)

xcollazo changed the task status from Open to In Progress.Mar 20 2025, 3:15 PM
xcollazo claimed this task.
xcollazo triaged this task as High priority.
xcollazo renamed this task from Use the Spark-Iceberg built in CDC mechanism to PoC a replacement for wmf.wikimedia_wikitext_current to Spike: Figure how best way to produce wmf_content.mediawiki_content_current_v1.Apr 7 2025, 6:19 PM
xcollazo updated the task description. (Show Details)
xcollazo renamed this task from Spike: Figure how best way to produce wmf_content.mediawiki_content_current_v1 to Spike: Figure how best to produce wmf_content.mediawiki_content_current_v1.Apr 7 2025, 6:50 PM

Good work. I wonder if flipping the problem around and writing the latest PK in some faster lookup table would be useful, then you can skip all that deduplicating window stuff. So there'd be a second merge into that just overwrites a latest primary key, I guess it'd have to be timestamp, revision_id, page_id, wiki_id.

Good work. I wonder if flipping the problem around and writing the latest PK in some faster lookup table would be useful, then you can skip all that deduplicating window stuff. So there'd be a second merge into that just overwrites a latest primary key, I guess it'd have to be timestamp, revision_id, page_id, wiki_id.

Interesting, that could def solve the performance difference indeed, but then I would have double logic ( the procedure + the SQL to dedup via lookup ), and I would also need to keep state ( because I would still need to track what incremental changes I've consumed from the procedure ). All reasonable if we were pursuing hourly updates.

Let's keep all this ideas though for when we pursue v2 of mediawiki content / enrichment.