We have found records in the archive table that have the same rev_id and rev_timestamp as existing revision table records. TODO: see if there's an efficient way to filter these out in the sqoop and to file a bug with mediawiki-core for it if there are recent examples.
This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!
For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)
This task took a wild ride through our board, not sure what happened and why I deprioritized it, but it seems like something to look into to ensure the quality of the mw history dataset