The new content history datasets are a nice improvement. As can be expected, there are are data quality issues that have popped up. This task is meta-task is collecting these separate issues, both to give visibility to need to fix, and also to assess how they might impact the work of research scientist.
- Duplicate rows for same page/revision id T410431
- Redirects cannot be easily filtered T400632
- Inconsistent formatting of page_title T410405
- Duplicate page title in current T413888
- Reconciliation accuracy: T412461
The data issues risk decreasing the trust in the content history dataset itself, which lowers the confidence when sharing quick investigations/analysis, and leads to an overhead in double-checking results to make sure they are not affected by known issues. This was not the case with the previous dumps1 mediawiki history, but that dataset was discontinued, so the content history is the only source of content available.