Page MenuHomePhabricator

Fix ParserOutput corruption wmf.10 -> wmf.11
Closed, ResolvedPublicPRODUCTION ERROR

Description

We're seeing a 100-1000 times increase of errors per minute across wikis and servers due to what appears to be invalid CacheTime objects stored in ParserCache by wmf.11 code,. We have rolled back to wmf.10, and the above patch has invalidated the affected cache entries.

Link to example errors in Kibana:
https://logstash.wikimedia.org/goto/e4ff1589fd26334c21e663a6be6ba1be

Each page view seems to experience a cascading chain of about ~10 different PHP errors/warnings:

PHP Warning: array_intersect(): Argument #1 is not an array
#1 /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserOptions.php(1397): array_intersect(NULL, array, array)
#2 /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserCache.php(251): ParserOptions->optionsHash(NULL, Title)
#3 /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserCache.php(283): ParserCache->getKey(WikiPage, ParserOptions, integer)
#4 /srv/mediawiki/php-1.36.0-wmf.10/includes/page/Article.php(733): ParserCache->get(WikiPage, ParserOptions)

This appears to be the first one in the chain, it seems CacheTime->mUsedOptions might be null or unset?

Lots of cascading due to the required $forOptions array parameter being null, which are causing wrong cache keys to be used for reading or writing the cache:

PHP Warning: in_array() expects parameter 2 to be array, null given
/srv/mediawiki/php-1.36.0-wmf.10/extensions/Math/src/MathHooks.php:116      
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.10/extensions/Math/src/MathHooks.php(116): in_array(string, NULL)
#2 /srv/mediawiki/php-1.36.0-wmf.10/includes/HookContainer/HookContainer.php(331): MathHooks::onPageRenderingHash(string, User, NULL)
#3 /srv/mediawiki/php-1.36.0-wmf.10/includes/HookContainer/HookContainer.php(138):

Event Timeline

Change 631318 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Reject ParserCache entries from the last wmf.11 deployment

https://gerrit.wikimedia.org/r/631318

Change 631318 merged by jenkins-bot:
[operations/mediawiki-config@master] Reject ParserCache entries from the last wmf.11 deployment

https://gerrit.wikimedia.org/r/631318

My fault. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/629405 has to be reverted, which would make it a 'revert revert revert'

Change 631239 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput"""

https://gerrit.wikimedia.org/r/631239

Change 631324 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.36.0-wmf.10] HACK/ParserCache: Forche cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631324

Change 631324 merged by Krinkle:
[mediawiki/core@wmf/1.36.0-wmf.10] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631324

Change 631240 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@wmf/1.36.0-wmf.11] Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput"""

https://gerrit.wikimedia.org/r/631240

This is all clearly may fault. My apologies, and thank you to @Krinkle and everyone involved for dealing with the mess I've caused.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631239 reverts the faulty patch in master,
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631240 is a backport to wmf.11

wmf.11 is completely rolled back, hopefully it can roll out tomorrow but I'm not too confident. If @hashar wants to try it during EU window that's up to him ;)

This is not the first time a bug I've noticed (and reported) in translatewiki.net also causes issues in WMF production. Some of them I mark as train blockers when it is obvious they will affect WMF production too, but some issues like this are not so obvious. Any ideas how to improve this process so that this won't happen again?

wmf.11 is completely rolled back, hopefully it can roll out tomorrow but I'm not too confident. If @hashar wants to try it during EU window that's up to him ;)

I don't feel very confident to roll it without assistance from people knowing about the issue / ParserCache being available. I will look at being available in my evening to assist though.

Is the issue due to unserializing a ParserOutput object that got serialized based on a different version of the class? If so that has hit us several time in the past. At least T229366 and we had some test to check the unserialization: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/526575/2/tests/phpunit/includes/parser/ParserOutputTest.php

If that is the same problem, maybe we would need a specific test that runs whenever ParserOutput is changed and ensure that we can unserialize from a previous version. Or at least that the cache key is altered.

So, as I understand it, the issue is that my patch has moved a property up the class hierarchy, and I've added fallbacks for reading old ParserOutput with new version of the code, tested that path, added a unit test, all that. However, I have completely forgot about rolling back, and making sure the old version of the class can read the new serialized data. Rookie mistake.

As for making sure this doesn't happen again: T263579 should help - as a part of Parsoid integration we are going to change the serialization format to JSON. We should make sure to include more explicit upgrade/downgrade logic in JSON serialization since we will have much more control over it.

As for moving the train forward, complete revert of the faulty patch (631240) should be enough (but not of Timo's hacks yet. Those will need to be reverted later when we're sure there's no bad ParserOutput in the cache, I'll do that). However at this point I do not have enough self-confidence to make any definitive statements.

I too am pretty uncertain about wmf.11, there were actually multiple issues that cropped up yesterday and it was hard to tell which ones were real because we had a lot of nonsensical / seemingly impossible errors that were caused by opcode cache corruption. The corruption errors were obscuring the other issues for a while.

@Pchelolo Can you put the various patches that need to be rolled out to unblock this on the deploy calendar and/or coordinate with releng?

Krinkle triaged this task as Unbreak Now! priority.Oct 1 2020, 6:02 PM

(Train blocker)

It has been reverted in master, wmf.11 backport needed to unblock is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631240

Change 631239 merged by jenkins-bot:
[mediawiki/core@master] Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput"""

https://gerrit.wikimedia.org/r/631239

Change 631471 had a related patch set uploaded (by Ppchelko; owner: Krinkle):
[mediawiki/core@master] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631471

Change 631473 had a related patch set uploaded (by Ppchelko; owner: Krinkle):
[mediawiki/core@wmf/1.36.0-wmf.11] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631473

Change 631473 abandoned by Ppchelko:
[mediawiki/core@wmf/1.36.0-wmf.11] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

Reason:
created another one.

https://gerrit.wikimedia.org/r/631473

Change 631473 restored by Ppchelko:
[mediawiki/core@wmf/1.36.0-wmf.11] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631473

For the ParserCache blocker, this is the chain of back ports that moves wmf.11 to the same exact state as wmf.10 and master for the parserCache: They are to be merged before deployment.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631240/2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631473/2

@Pchelolo: Should we close this or leave it open? (not sure if there is remaining work besides merging the aforementioned patches?)

This should be resolved now yes. I'm going to create a couple followup tasks and will cross reference them to here.

Change 631240 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.11] Revert "Revert "Revert "Hard deprecate all public properties in CacheTime and ParserOutput"""

https://gerrit.wikimedia.org/r/631240

Change 631473 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.11] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631473

Mentioned in SAL (#wikimedia-operations) [2020-10-01T20:08:13Z] <twentyafterfour@deploy1001> Synchronized php-1.36.0-wmf.11/includes/parser/: sync ParserCache patches to unblock the train T264257 T263177 (duration: 00m 59s)

Change 631471 merged by jenkins-bot:
[mediawiki/core@master] HACK/ParserCache: Force cache-miss if mUsedOptions is undefined

https://gerrit.wikimedia.org/r/631471