MediaWiki History pipeline is complex. We need to know: what are our immediate risks and how can we mitigate them?
MediaWiki History Reduced Checker alarm 2023-07
In the 2023-07 run, the automated job that compares the current and previous snapshots threw an alarm that we had fewer than expected results in a few categories.
We currently suspect this is due to a change in the source data, namely the correct redaction of the rev_actor field in the cloud replica views that we sqoop from. With this redacted throughout history wherever there's a rev_deleted flag that indicates we should hide the user, the input data has fewer revisions and therefore the output should have fewer results. A quick look at the MW History algorithm did not confirm this, so we need a deeper look. This task is the placeholder for this look as well as the place we should update the status of the data quality findings.