Data quality checks show that we are missing or have mismatched metadata for a significant amount of revisions on wmf_dumps.wikitext_raw_rc2.
Code that generated these figures at : https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/notebooks/Can_we_emit_wiki_db__revision_id_pairs_for_reconciliation.ipynb?ref_type=heads
Status
Last updated: 2024-10
We adopted reconciliation strategy predicated around:
- a batch job that runs periodically against analytics mysql replicas to detect missed/mismatched revisions.
- the job will generate "reconciliation events" for the missing/mismatched revisions.
- enrichment of these events (wikitext and redirect info) is performed by a streaming enrichment application, that piggy backs on page_content_change logic (and mediawiki-event-enrichment infra).
- a batch job will join existing and reconciled data and write into the content history table.
A design doc for this approach - as well as alternatives we considered - currently lives in Google Doc. We'll move it to Wikitech once the current Dumps 2.0
implementation phase is concluded and docs are finalized.
The general idea for the reconciliation mechanism is as follows:
Reconciliation mechanism
The Dumps 2.0 project strives to have a correct and complete copy of all the revisions from all wikis over all of wiki time.
For recent data, we leverage data produced by Event Platform, namely tables event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change. For historical data, we currently backfill from wmf.mediawiki_wikitext_history, which itself is populated via importing Dumps 1.0 output.
None of these systems guarantee complete, or correct, data. Event Platform, or the upstream Event Bus, may fail and miss events. Similarly, Dumps 1.0 frequently skips rows due to runtime errors. Additionally, we want to avoid the dependency on Dumps 1.0 infrastructure. For this, we need a reconciliation mechanism that should:
- Replace the current backfill mechanism, thus decoupling us from Dumps 1.0.
- Provide a way to sense whether revisions are missing or incomplete.
- Provide a way to fetch revisions that are missing or incomplete.
Provide a way to sense whether revisions are missing or incomplete.
As part of our work to create data drift metrics (T354761), we also made a proof of concept mechanism to connect the Analytics Replicas to Apache Spark. We propose to use a similar mechanism to find inconsistencies between the source of truth and our data lake table in two phases: Recent Data and Historical Data. Both phases would utilize the same mechanism:
For each wiki do:
- Fetch the revisions from wmf.wikitext_raw as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
- Fetch the revisions from the Analytics Replica as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
- Compare both, and figure whether we are missing records. If so, keep them on a list.
- Compare both, and figure whether records disagree in terms of metadata and/or content. If so, keep them on a list.
For the run for 'Recent data', we would only run the reconciliation mechanism for a window of the last day minus a gap to account for the latency of our ingestion, which is typically ~1 hour. We will run this daily.
For the run for 'Historical data’, we would run the reconciliation mechanism for all the wiki time of a particular wiki, likely in windowed chunks so as to not overwhelm the replicas. We will run this monthly.
At the end of this process we will have a list of offending (wiki_db, revision_id) pairs. These pairs will be kept in a datalake table, for now named wmf_dumps.wikitext_ missing_or_innaccurate_rows for a reasonable amount of time, say 90 days.
This table will be used for the next steps.
Provide a way to fetch revisions that are missing or incomplete.
Now that we have an offending (wiki_db, revision_id) list, we need to somehow fetch this data. Sensing data at the wmf.wikitext_raw level makes sense, considering that we do ingest all events from event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change.
However, other systems can leverage the fact that we have detected a set of revisions that have not made it to our eventing infrastructure. Considering this, we propose to not have a bespoke mechanism that would fetch the offending revisions for only our consumption. Rather, we propose that an upstream system, Event Bus, should have an API in which we can do the following:
- Given a list of (wiki_db, revision_id) pairs, this API should accept these and fetch and produce the latest state of the pair to the a new 'reconciliation' Kafka topic. This Kafka topic will be consumed by a Flink job very similar if not the same job as the page_content_change job, and it will produce similar output. Since this new stream will be part of Event Platform, a gobblin process will materialize these late events into the datalake under the event schema.
This way, any interested system will also be able to consume 'reconciliation' events.
In the case of Dumps 2.0, consuming these new events will require a new MERGE INTO job to be put together, very similar to the existing events_merge_into.py. In fact, hopefully just changing the source table in this pipeline should suffice as the schema should be the same.
Mine the data available in wmf_dumps.wikitext_ missing_or_innaccurate_rows
We want to keep wmf_dumps.wikitext_ missing_or_innaccurate_rows around for a bit as it seems useful not just for emitting reconciliation events. It is, effectively, a good source for data quality metrics, and also for understanding data issues current and future.
From the point of view of data quality metrics, we should use this data to produce daily metrics on the 'recent' quality of the data in wmf_dumps.wikitext_raw as well as historic. This will superseed the metrics developed on T354761: Implement first set of data quality checks considering this is a cheaper way to get similar quality/drift information.
From the point of view of data issues, in the event that we see a lot of revisions having issues (say, more than 5%), we'd want to alert on it and perhaps even stop the emission of reconciliation events as to not overwhelm the system.

