Page MenuHomePhabricator

Reduce number of missing revisions from mediawiki-history
Open, Needs TriagePublic

Description

We currently have missing revisions from mediawiki_history in comparison to events. The analysis here: https://phabricator.wikimedia.org/T215001#7465848 shows around 1k per day for days just before the snapshot is taken. The reason for those missing revisions is the order in which we import the mediawiki tables: we currently do archive, then a bunch of other tables, then revision. For big projects, this means some hours between importing archive and revision, leading to same deleted pages happening in between the imports, and revision not being in archive nor in revision.
To fix I suggest:

  • force the order of tables imports to revision, then archive, then page, then logging (the number of restore is a lot smaller than the number of deletes)
  • Make mediawiki_history job check for delete/restore times when a revision occurs both in archive and revision tables, in order to make the choice of which one to keep.