Data quality checks show that we are missing a coupleor have mismatched metadata for a significant amount of revisions on `wmf_dumps.wikitext_raw_rc2`. For example, a run on `simplewiki` [[ https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/24#note_70045 | done here ]] shows the following:
{F55542315} {F55542336}
```Code that generated these figures at : https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/notebooks/Can_we_emit_wiki_db__revision_id_pairs_for_reconciliation.ipynb?ref_type=heads
The general idea for the reconciliation mechanism is as follows:
# Reconciliation mechanism
do_quality_check_on_revision_history:The Dumps 2.0 project strives to have a correct and complete copy of all the revisions from all wikis over all of wiki time.
For recent data, we leverage data produced by Event Platform, namely tables `event.mediawiki_page_content_change_v1` and `event.mediawiki_revision_visibility_change`. For historical data, we currently backfill from `wmf.mediawiki_wikitext_history`, which itself is populated via importing Dumps 1.0 output.
None of these systems guarantee complete, or correct, data. Event Platform, or the upstream Event Bus, may fail and miss events. Similarly, Dumps 1.0 frequently skips rows due to runtime errors. Additionally, we want to avoid the dependency on Dumps 1.0 infrastructure. For this, we need a reconciliation mechanism that should:
- Replace the current backfill mechanism, thus decoupling us from Dumps 1.0.
+----------------------+ - Provide a way to sense whether revisions are missing or incomplete.
|revision_count_summary| - Provide a way to fetch revisions that are missing or incomplete.
## Provide a way to sense whether revisions are missing or incomplete.
As part of our work to create data drift metrics (T354761), we also made a [[ https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/5c9f50c717327fb59be4426ed20f357343cc7456/notebooks/Access%20MariaDB%20From%20Cluster.ipynb | proof of concept mechanism ]] to connect the Analytics Replicas to Apache Spark. We propose to use a similar mechanism to find inconsistencies between the source of truth and our data lake table in two phases: **Recent Data** and **Historical Data**. Both phases would utilize the same mechanism:
For each wiki do:
- Fetch the revisions from wmf.wikitext_raw as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
+----------------------+ - Fetch the revisions from the Analytics Replica as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
|0.9987998799879988 | - Compare both, and figure whether we are missing records. If so, keep them on a list.
+----------------------+ - Compare both, and figure whether records disagree in terms of metadata and/or content. If so, keep them on a list.
```For the run for '**Recent data**', we would only run the reconciliation mechanism for a window of the last day minus a gap to account for the latency of our ingestion, which is typically ~1 hour. We will run this daily.
These revisions are presumably not available in Dumps 1.0For the run for '**Historical data**’, we would run the reconciliation mechanism for all the wiki time of a particular wiki, as `wmf_dumps.wikitext_raw_rc2` is backfilledlikely in windowed chunks so as to not overwhelm the replicas. We will run this monthly.
In this task we need to:
[] Study the missing revisionsAt the end of this process we will have a list of offending `(wiki_db, revision_id)` pairs. These pairs will be kept in a datalake table, for now named `wmf_dumps.wikitext_ missing_or_innaccurate_rows` for a reasonable amount of time, say 90 days.
This table will be used for the next steps.
## Provide a way to fetch revisions that are missing or incomplete.
Now that we have an offending `(wiki_db, revision_id)` list, we need to somehow fetch this data. Sensing data at the `wmf.wikitext_raw` level makes sense, considering that we do ingest all events from `event.mediawiki_page_content_change_v1` and `event.mediawiki_revision_visibility_change`.
However, other systems can leverage the fact that we have detected a set of revisions that have not made it to our eventing infrastructure. Considering this, we propose to not have a bespoke mechanism that would fetch the offending revisions for only our consumption. Rather, we propose that an upstream system, Event Bus, should have an API in which we can do the following:
- Given a list of `(wiki_db, check if they are missing from Dumps 1.0 as well,revision_id)` pairs, this API should accept these and fetch and produce the latest state of the pair to the a new 'reconciliation' Kafka topic. or if it indeed missing events from Event PlatformThis Kafka topic will be consumed by a Flink job very similar if not the same job as the page_content_change job, or other issue?
[] If the study shows that we can recover these revisionsand it will produce similar output. Since this new stream will be part of Event Platform, then implement a PySpark job to fetch and incorporate them as neededa gobblin process will materialize these late events into the datalake under the `event` schema.
This way, perhaps via MW API?any interested system will also be able to consume 'reconciliation' events.
In the case of Dumps 2.0, [If MW APIconsuming these new events will require a new MERGE INTO job to be put together, see if we can leveragvery similar to the existing code from [[ https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/blob/main/mediawiki_event_enrichment/page_content_change.py?ref_type=heads | page_content_change.py ]]?]org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/mediawiki_content_dump/events_merge_into.py | events_merge_into.py ]]. In fact, hopefully just changing the source table in this pipeline should suffice as the schema should be the same.
## Mine the data available in `wmf_dumps.wikitext_ missing_or_innaccurate_rows`
We want to keep `wmf_dumps.wikitext_ missing_or_innaccurate_rows` around for a bit as it seems useful not just for emitting reconciliation events. It is, effectively, a good source for data quality metrics, and also for understanding data issues current and future.
From the point of view of data quality metrics, we should use this data to produce daily metrics on the 'recent' quality of the data in `wmf_dumps.wikitext_raw` as well as historic. This will superseed the metrics developed on {T354761} considering this is a cheaper way to get similar quality/drift information.
From the point of view of data issues, in the event that we see a lot of revisions having issues (say, more than 5%), we'd want to alert on it and perhaps even stop the emission of reconciliation events as to not overwhelm the system.