Page MenuHomePhabricator

En Wikipedia stub dumps short for April 2016
Closed, ResolvedPublic

Description

The en stubs are shorter than usual and missing content has been reported. Investigate; if a single page range is missing, dump it as an extra stub and do all the follow up steps as well. Otherwise, determine what other remedial steps can be taken.

Event Timeline

ArielGlenn created this task.

The pages-articles stubs for parts 23 through 27 are missing a lot of data. These need to be re-run. No indication in the logs of any issue. Next up is to check all stubs for meta-current and meta-history.

Note that although some of the earlier stub files in the sequence, i.e. 14 through 19, are a bit smaller than the corresponding files in March, this is primarily a consequence of image deletion or other such cleanup.

Just verified that the same holds true for the meta-current and meta-history stubs. Since these are all generated at the same time, this is to be expected.

These five jobs are now running manually in a screen session on snapshot1005. Once they are completed I will run the recombine steps and make them available for download immediately. After that I'll rerun the page content dumps for these stubs, doing multiple checkpoint files at once so we can get them done faster. We can do this since nothing else is running and there are plenty of spare CPU and disk cycles. Last will be the pages-articles recombine and the multi-stream bz2 page content dump.

Stub files, including the recombined file, are ready. Page content next.

Page content jobs are now running manually in screen sessions on snapshot100[5-7]. These ought to be done in a day or a day and a half, both bz2 and 7z files.

Bz2 files for all page content dumps for en wikipedia re now available. 7z recompression is running manually, as well as the pages-articles and pages-meta-current recombines.

All dumps have completed. Status update job is running now, at which time the dump will be marked as successful and all files available for download.

Aaaand I did not remember to run the rest of the 7z jobs of course. Doing that now.

Thanks! I pulled down the latest dump and it looks good. The Module count is up to 3163 and individual spot-checking looks good.

I'll do some more processing later this weekend and post a follow-up to the mailing list when done.

Thanks again!

The 7z job completed and the status update is done as well. The dates on a couple of the steps are wrong but the files should all be good to go.

The latest dump looks good. I'm running it through my XOWA parser now, and have not seen any issues. I'll post a quick message to the mailing list now. Thanks again for the follow-up!

[EDIT: actually, going to hold off on the message until confirmation from Marcus]

Yup. I think we're good. Thanks again for looking into it. Let me know if there is anything else (should I be the one to mark it resolved?)

Nope, I'm closig it right now. Thanks!