Page MenuHomePhabricator

wikidata page dumps very slow this month
Closed, ResolvedPublic

Description

It's taken from Dec 8 til today to generate the files that ordinarily would have been done in two days. Need to see what's up.

Event Timeline

I had a few theories to check:

  1. Are there more revisions this month all of a sudden? Possible, but a check of revision count in stubs-meta-history-1 for Nov and Dec shows an increase of only 89180436 to 90827879. Ruled out.
  2. HW/nfs/disk issues? I checked atop on both dataset1001 and snapshot1006, dmesg, syslog, etc. Nothing there.
  3. Prefetch issues? Are we not finding the revision texts on disk for some reason? TRUE: but only for stubs-1. stubs parts 2,3,4 all ran fine with 98 to 99% rev texts prefetched.

Great, so it's a prefetch issue. This rules out things like db issues.

From Dec 8 through Dec 15 at 01:10:57 we had a steady 0.0% revs prefetched. From Dec 15 01:10:57 we suddenly saw prefetches happening and the percentage steadily increased from then. I have checked all logs including SAL, puppet, etc and my own work for anything that might have happened around then: a big fat zero.

I compared the first 200k lines of stubs-1 from Nov and Dec ful1 runs, they are mostly comparable except for pages that get removed. Order of revisions seems the same. Checked the contents of a number of entries to be sure. Same sha1s, everything. Did a raw diff, about 8k lines out of 200k differ, which is to be expected.

I looked at the first part of the revision content output for November corresponding to stubs-1. Looks sane, I'll do more in depth checking shortly.

In order to play cachup I have three jobs running out of a root screen session on snapshot1005 that are generating the missing files from stubs-meta-history-1. Sometime tomorrow I'll shoot the regular process, run one more missing piece and then start up the 7z job.

Note that 2 of the jobs are at 98% prefetch and the third at 88%, leaving me at a loss to understand what the initial problem was.

I'll try rerunning a part of the first checkpoint file later to see if it still gives 0% prefetch or if it magically "works".

Found it! It was the sort order of the prefetch files, broken when converting from .sort() to sorted() and fixed just yesterday. Resolving this ticket. GOOD GRIEF