Page MenuHomePhabricator

Add an errata to published "v1" scraper data
Closed, ResolvedPublic


We've already found several different major flaws in the first published release of our scraper outputs. This task is finished once some minimal documentation of these issues is added to the figshare metadata. It can be as simple as a warning banner saying that the data is generally unusable and linking to some Phabricator tasks, or we can go as far as unpublishing the data entirely.

Some of the issues:

  • Many dumps were truncated (T345176).
  • Pages appeared multiple times, sometimes with different revision numbers (T354018).
  • Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article (T353321).
  • Reference similarity was overcounted when more than two refs shared content (T350145).

Event Timeline

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2023-12-06 board.

We'll unpublish the data for now, and try to leave a note explaining our reasoning and where to watch for the v2 analyses.

awight moved this task from Doing to Done on the WMDE-TechWish-Sprint-2023-12-06 board.

Deprecation notes are added to the metadata, the data files should be deleted from WMF hosting, and an index.html linking to this task is copied to the empty directory.

awight claimed this task.