Page MenuHomePhabricator

frwiki dump from 20181020 contains incomplete data
Closed, ResolvedPublic

Description

Many files look too small in https://dumps.wikimedia.org/frwiki/20181020/. For instance, pages-meta-current decreased by 38% compared to the previous dump:
frwiki-20181001-pages-meta-current.xml.bz2 5.5 GB
frwiki-20181020-pages-meta-current.xml.bz2 3.4 GB

Event Timeline

Some of the stub dumps have incomplete data, almost certainly due to T207628. I'll audit all stub dumps and rerun any that are short, along with the recombine steps and the various page content dumps.

I've checked all wikis that have fewer pages in stubs from the Oct 20 run than the Oct 1 run. The following wikis need reruns of some jobs: frwiki, trwiki, arwiki
I'll determine which stub files are incomplete, rerun those and then rerun the steps dependent on those new files.

Jobs and parts of jobs that need to be rerun:

  • frwiki: stubs articles, meta-current, meta-history 3 and 6 ; pages-articles, pages-meta-current 3 and 6; stubs recombine; pages-articles recombine; pages-meta-current recombine; multistream-bz2
  • arwiki: stubs articles, meta-current, meta-history 4; pages-articles, pages-meta-current 4; stubs recombine; pages-articles recombine; pages-meta-current recombine; multistream-bz2
  • trwiki: stubs articles, meta-current, meta-history (entire file); pages-articles, pages-meta-current (entire file); multistream-bz2

I believe all the bad files have been moved out of the way and status files updated with failure for the specific jobs; I have rescheduled the dumps cron and it has restarted on snapshot1009. I'll be keeping an eye on it to make sure all the files and jobs we want are rerun.

All three wikis seem to have rerun the corresponding stubs jobs and are progressing slowly through the rest of the steps. I'll be checking on them this weekend, and double-checking to be sure that all the content is there too.

ArielGlenn claimed this task.

All three wikis have completed their runs and the files look complete. Closing this ticket.