The following is a list of changes we want to try and test on the mediawiki history reconstruction pipeline. If you're going to grab an item, cross it off and add your name next to it. There's just too many of these to make a separate task for each one, and they might not turn out to be useful or real issues.
- (@JAllemandou or @fdans?) on processDeleteEvents in page history builder, if a PageId exists, don't assign a fakeId
- https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/485710
- Discard events if timestamp is 0: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageEventBuilder.scala#L81
- https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494
- Use page_id when it exists in subgraph to partition better maybe (in case we miss events in the middle of a page's history, for example)): https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageEventBuilder.scala#L118
- Complex change - Creating its own task - T218130
- Use the page id extracted from log_page on move events, that we currently ignore:
- Discard revisions with rev_page = 0: mysql:research@analytics-store.eqiad.wmnet [etwiki]> select * from revision where rev_page = 0; https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageHistoryRunner.scala#L217
- https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494
- Should fail if page_id is none, does not make sense from database constraints: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageHistoryRunner.scala#L236
- https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494
- Should fail if rev2.rev_user is none, does not make sense from db constraints: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/page/PageHistoryRunner.scala#L249 (NOTE: this can be updated to be simpler when rev_actor migration is complete)
- Invalid: user can be undefined for the first rev: now we have user_text, but it can also be undefined if revision_deleted&4.
- Refactor Page Events to include "create" events from logging table, and create artificial "create" events out of the current logic that gets the first revision of a page. Consider as part of that the trick that Joseph found, to look into the archive table for the first revision when those happen to be archived for a restored page.