Page MenuHomePhabricator

de-duplicate archive records matching revision records in mediawiki_history
Open, LowPublic

Description

We have found records in the archive table that have the same rev_id and rev_timestamp as existing revision table records. TODO: see if there's an efficient way to filter these out in the sqoop and to file a bug with mediawiki-core for it if there are recent examples.

Event Timeline

Nuria moved this task from Incoming to Dashiki on the Analytics board.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

This task took a wild ride through our board, not sure what happened and why I deprioritized it, but it seems like something to look into to ensure the quality of the mw history dataset

Milimetric raised the priority of this task from Low to Medium.Jul 20 2020, 3:49 PM
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
odimitrijevic lowered the priority of this task from Medium to Low.Jan 6 2022, 4:25 AM