Page MenuHomePhabricator

Apr 1 2019 and/or >=1.33-wmf.23 dump run issues
Open, NormalPublic0 Story Points

Description

Because there have been a number of issues with the run and with branches 1.33-wmf.23 or greater, I'm collecting them all here so I can manage them better.

  • T217329 bug in 1.33.0-wmf.18 breaks abstract dumps on testwikidatawiki | MWContentSerializationException $entityId and $targetId can not be the same
  • T220160 getRedirectTarget should not automatically load revision content in all cases
  • T220316 XmlDumpWriter::openPage handles main namespace articles with prefixes that are namespace names AND are redirects incorrectly
  • T220424 XmlDumpWriter::writeRevision sometimes broken by duplicate keys in Link Cache
  • T220493 Xml stubs dumps are running 5 to 15x slower than previously
  • T220257 dumpBackups.php failing with InvalidArgumentException thrown from RevisionStoreRecord for certain wikis
  • T220594 abstracts dumps for dewikiversity fail with MWUnknownContentModelException from ContentHandler.php
  • T220793 content still marked as flow-board on urwikibooks breaks abstract dumps
  • T220940 Abstracts dumps for Commons running very slowly

An X next to the task means all the needed dumps patches have been merged and deployed to all wikis. (coming shortly)

These can be summarized as the following:

  • stubs run much slower
  • abstracts run much slower
  • pages with the same key in the link cache now cause fatals during stubs/abstract dumps
  • wikibase entities which are self-redirects now cause fatals during abstract dumps
  • revisions that are left marked flow-board on wikis which no longer have flow enabled, now cause fatals during stubs/abstract dumps
  • attempts to use revisions from a second page for a first page's dump now cause fatals during stubs/abstract dumps
  • bad text table entries (typically with DB://cluster20/0) now cause fatals during stubs/abstract dumps

In most cases we need to resolve the issue at two levels: fixing dumps to be more resilient, and fixing the underlying problem (bad revisions, duplicate link cache keys, etc).

Event Timeline

ArielGlenn triaged this task as Normal priority.Apr 16 2019, 1:40 PM
ArielGlenn created this task.

I might as well note here things being done to patch up the Apr 1 2019 run as well.

For wikidata I am running 7zs manually into a separate directory, out of screen on snapshot1009. That host is otherwise idle since enwiki finished up; I ran a noop on that after forcing a manual run of abstract5 with live patches.
I am keeping an eye on commons; if the history run takes too long I'll do 7zs manually for that too.
I think it likely that everything else will complete on time with the possible exception of viwiki and commons abstracts; I am working on a dirty hack to get these done (which will also require manual intervention, due to speed issues).

The two wikis left to complete are commons and wikidata. I'll be trying to get the commons abstracts to complete today.

Running commons abstract piece 5 into a separate directory now manually on snapshot1005, with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504792/ and https://gerrit.wikimedia.org/r/#/c/operations/dumps/+/504842/ applied locally to 1.34_wmf1 and 1.33_wmf25.

Running commons flow dumps manually in another window.

The commons abstract part 5 is done already. Running commons flow history manually, after that will be the multistream dumps, and then I'll run the abstracts recombine and make all the abstracts available.

For wikidata we are simply waiting for the bz2 history dumps to complete, and I'm running a 7z recompress job periodically to convert any new bz2 files produced.

Commons is done and should be available soon on the webserver. Wikidata is still finishing up page content bz2 files; that and the 7z recompressed files are the last for this run.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Apr 20 2019, 7:34 AM

I'm delaying the start of the wikidata dumps for a few hours, to give the manual generation of 7z files time to complete. A four hours delay ought to be enough.

The noop job is running for wikidata, which will generate checksums and update links; There's also one running for enwiki because of my typo; no harm done except that the rss feed might be a bit odd until later in the day.

Sometime later today or at worst tomorrow the wikidata files should be available for download.

ArielGlenn updated the task description. (Show Details)Apr 21 2019, 7:14 AM
ArielGlenn updated the task description. (Show Details)Apr 21 2019, 9:18 AM
ArielGlenn updated the task description. (Show Details)Apr 21 2019, 3:27 PM
ArielGlenn updated the task description. (Show Details)Jul 3 2019, 6:04 AM