Page MenuHomePhabricator

Latest English Wikipedia Wikimedia Enterprise HTML dumps do not seem to be updated
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

Hi,

Thanks for making the English Wikipedia Wikimedia Enterprise HTML dumps available, it's a great resource to the community. Pardon me if I'm missing something obvious, but it seems like the latest dumps haven't been updated since 6/14

For example, if you take a look at enwiki-NS0-20220720-ENTERPRISE-STATS.json (link), it says:

{"wiki": "enwiki", "md5sum": "491fd5b65825240f52e5295260746b26", "date_modified": "2022-06-14T17:09:02.080073819Z"}

and if you look at enwiki-NS0-20220801-ENTERPRISE-STATS.json (link), you get the exact same output, down to the same md5sum.

Manually downloading the 07/20 dump and inspecting the last modified date on the enwiki*.ndjson files also shows that the last modified date is June 14.

Thus, it seems like these latest dumps haven't been happening properly / aren't reflecting the state of enwiki at the time of dumping. Would it be possible to get this fixed (and maybe even get fixed links for these previous dumps, if possible?)

Thanks in advance!

Event Timeline

@nfliu Unfortunately, the HTML dumps don't seem to be very reliable at the moment.

Hey, thanks for flagging this. Will review and fix. Would not be possible to do this for going back in history but will be fixed for the new mirror downloads.

Protsack.stephan triaged this task as High priority.
Protsack.stephan set the point value for this task to 5.
Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.
Lena.Milenko changed the task status from Open to In Progress.Sep 22 2022, 1:46 PM

This one should be fixed. Next downloads for public mirroring should include latest dump and metadata.

@Protsack.stephan great! however, looks like the october dumps haven't been generated yet?

Lena.Milenko changed the task status from In Progress to Open.Oct 3 2022, 12:56 AM

The stats now have a correct timestamp, but there's still missing data. Can you please fix this? With this unpredictable mix of old and new data they're useless for most purposes right now, might as well not generate them at all.

Will trigger the pipeline to level up the dataset. That should help out with consistency of data.