We've already found several different major flaws in the first published release of our scraper outputs. This task is finished once some minimal documentation of these issues is added to the figshare metadata. It can be as simple as a warning banner saying that the data is generally unusable and linking to some Phabricator tasks, or we can go as far as unpublishing the data entirely.
Some of the issues:
- Many dumps were truncated (T345176).
- Pages appeared multiple times, sometimes with different revision numbers (T354018).
- Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article (T353321).
- Reference similarity was overcounted when more than two refs shared content (T350145).