Page MenuHomePhabricator

Wikidata not reachable - MediaWiki internal error
Closed, ResolvedPublic

Description

Loading whichever Wikidata page I obtain the following:

<<
MediaWiki internal error.

Original exception: [d085508f-d496-414d-8dcd-3e83e828d741] 2021-07-21 16:30:07: Fatal exception of type "MWException"

Exception caught inside exception handler.

Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.

The hash of the exception varies reloading the page.

Event Timeline

And now it seems solved; it lasted about two or three minutes.

Esc3300 raised the priority of this task from High to Unbreak Now!.Jul 21 2021, 4:33 PM
Epidosis claimed this task.

error rate on appservers was up for 5 minutes but recovered right after deploy was reverted

Yup, mea culpa. I tried to sync the T287085 backport in two steps: first WikibaseLexeme.entitytypes.php, where an entity type definition field was added, and then WikibaseLexeme.entitytypes.repo.php, where the corresponding field was removed. I believed that this would be the best way to deploy the change, because it would be okay to have the field in both files (and, deployed in this order, there would be no point in time where the field was missing from both files), but clearly I was wrong: instead, having the field in both files produced an instantaneous hard crash, which blocked the scap thanks to the canaries. At this point, the canaries were broken, and I wanted to revert the change (because canaries serve user traffic, meaning this affected a certain amount of users) before investigating more.

I tried to fix this by reverting the Git submodule update in php-1.37.0-wmf.15, running git submodule update, and then syncing the same path again, to restore the same file. This sync went through – probably because the canaries already had a high error rate, so the error rate didn’t rise – but it turns out that the git submodule update had not removed the commit from the WikibaseLexeme submodule, and so now the broken file was deployed to all of production, not just the canary servers.

I realized what had happened when error reports came in (e.g. on Telegram – I didn’t see this one until after the fact) and I looked at the git log for the WikibaseLexeme submodule, seeing that the change was still there. At this point, I could have tried again to revert the change properly (git checkout @^ in that directory, instead of trying to do anything via git submodule). But I realized that I might as well sync the second file now: the first file had reached all the hosts, and I knew that the state with both files synced was okay (I had tested that on mwdebug, after all), so syncing the second file (with the backport) would be just as good as syncing the first file (to the previous change), while also actually deploying the fix that I’d wanted to deploy in the first place. So I synced the second file, held my breath, and it went through, Wikidata recovered, and errors went down again.

My takeaways:

  • If you’re planning to sync in several steps (here: first one file, then the other), and you’re not absolutely confident that it will work in the state between both steps, test that as well, not just the state where both files are synced. But frankly, this sync didn’t feel very risky to me before. Not sure if this takeaway helps much.
  • The bigger one: to revert, run Git commands that you’re sure will actually revert the change (this would’ve been either a git revert or a git checkout @^ in the WikibaseLexeme submodule, not some submodule thing in the mwf.15 repo), and verify that they’ve worked, before syncing. In particular: when you’re syncing to repair broken canaries, you cannot rely on the canaries to protect you from further mistakes, so be extra careful.

By the way, the error was

Bad value for parameter $prefetchingTermLookupCallbacks: all elements must be callable

and, without having checked this, I’m guessing the reason for it was that we array_merge_recursive() those definition files, and so instead of one callable overriding the other, as I had expected, they were probably merged into an array of two callables.

I guess another takeaway would that scap should be able to rescue the canaries by itself, which was already identified as a possible followup from another incident: T225207: Enable scap to roll back broken changes to MediaWiki

It sounds like this might be planned for the new deployment workflow: T287046: scap backport --revert command

Very nice incident report 😃 Happy to see it was a learning experience.