Page MenuHomePhabricator

[Spike] Identify and mitigate risks associated with MediaWiki History pipeline
Closed, ResolvedPublic

Description

Goal

MediaWiki History pipeline is complex. We need to know: what are our immediate risks and how can we mitigate them?

MediaWiki History Reduced Checker alarm 2023-07

In the 2023-07 run, the automated job that compares the current and previous snapshots threw an alarm that we had fewer than expected results in a few categories.

We currently suspect this is due to a change in the source data, namely the correct redaction of the rev_actor field in the cloud replica views that we sqoop from. With this redacted throughout history wherever there's a rev_deleted flag that indicates we should hide the user, the input data has fewer revisions and therefore the output should have fewer results. A quick look at the MW History algorithm did not confirm this, so we need a deeper look. This task is the placeholder for this look as well as the place we should update the status of the data quality findings.

Details

Other Assignee
mforns

Event Timeline

VirginiaPoundstone renamed this task from MediaWiki History Reduced Checker alarm 2023-07 to [Spike] Identify and mitigate risks associated with MediaWiki History pipeline.Aug 30 2023, 2:50 PM
VirginiaPoundstone updated the task description. (Show Details)
Milimetric updated Other Assignee, added: mforns.

I think the short term risk is that the data is not correct.
Since the changes applied to rev_deleted/rev_actor, I did a short data vetting and couldn't find any weird data behaviors.
However, the MediaWikiHistory code is complex and so are the job's data flows, it is possible that details have escaped me.

I think there's a long term risk too.
The MediaWikiHistory reconstruction algorithm is highly coupled with the MediaWiki databases.
Whenever there's a significant change in the MediaWiki databases, chances are that our code will break.
If we add to that that the code is quite complex, there's the potential that we spend lots of maintenance time with this project in the future.

cjming moved this task from Paused to Sprint Backlog on the Data Products (Sprint 01) board.