Page MenuHomePhabricator

MediaWiki content history dataset issues
Open, Needs TriagePublic

Description

The new content history datasets are a nice improvement. As can be expected, there are are data quality issues that have popped up. This task is meta-task is collecting these separate issues, both to give visibility to need to fix, and also to assess how they might impact the work of research scientist.

  • Duplicate rows for same page/revision id T410431
  • Redirects cannot be easily filtered T400632
  • Inconsistent formatting of page_title T410405
  • Duplicate page title in current T413888
  • Reconciliation accuracy: T412461

The data issues risk decreasing the trust  in the content history dataset itself, which lowers the confidence when sharing quick investigations/analysis, and leads to an overhead in double-checking results to make sure they are not affected by known issues. This was not the case with the previous dumps1 mediawiki history, but that dataset was discontinued, so the content history is the only source of content available.

Event Timeline

Possibly related: are we missing (all?) page_delete events since we switched to DomainEvents?

T400380#11129969

Ottomata renamed this task from Content history dataset issues to MediaWiki content history dataset issues.Wed, Feb 11, 5:26 PM
Ottomata updated the task description. (Show Details)