Page MenuHomePhabricator

Stale data / missing pages in HTML ("enterprise") dumps
Open, MediumPublic3 Estimated Story PointsBUG REPORT

Description

  • Download enwiktionary HTML dump (April 1st, 2022)
  • Untar file
Stale data:
$ jq -r 'select(.name == "apreciable")' enwiktionary_*ndjson | head
{
  "name": "apreciable",
  "identifier": 2713698,
  "date_modified": "2021-03-19T05:53:16Z",
  "version": {
    "identifier": 62182446,
    "comment": "convert {{es-adj-old}} to new {{es-adj}} format",

What happens?:
The data returned is from March 2021. ("date_modified": "2021-03-19T05:53:16Z")

What should have happened instead?:
The data returned is from March 2022. (last edit 2022-03-09, diff)

Missing page:
$ jq -r 'select(.name == "paniaguarse")' enwiktionary_*ndjson
$

What happens?:
No output.

What should have happened instead?:
Data is returned for the page paniaguarse (created 2018-07-11)

There seem to be missing or outdated pages in all the recent (enwikt) HTML dumps I've tried. If it's useful, I can try to compile a list by diffing with the XML dump.

Event Timeline

Thanks for adding this @jberkel - we'll look into it!

Just a thought: perhaps the HTML dumps should be generated from the XML dumps, so that the revisions in both match (and they can both be used interchangeably without consistency problems).

Protsack.stephan triaged this task as Medium priority.
Protsack.stephan set the point value for this task to 3.
Protsack.stephan moved this task from In Progress to QA on the Wikimedia Enterprise board.
Lena.Milenko changed the task status from Open to In Progress.May 5 2022, 2:29 AM

I thin this might be related to T274359.

Lena.Milenko changed the task status from In Progress to Open.Thu, Jun 30, 2:23 PM

Any updates on this? The task has been moved around a bit recently, but it's not clear what is happening. Is it difficult to fix?