Page MenuHomePhabricator

PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Problem

From T221347#5122170 and T221347#5122194:

On 2019-04-18 from 10:00 UTC (possibly earlier), the code in mediawiki/Skin.php on the mw1274 server appeared to have changed in opcache. This lasted until 13:00 when the opcache was manually cleared.

Effective difference
  function getLogo() {
-     return $this->getConfig()->get( 'Logo' );
+     return $this->getConfig()->get( 'Kogo' );
  }
Original task description by @A2093064:
MediaWiki internal error.

Original exception: [XLhOBwpAAEUAAG7qWroAAABT] 2019-04-18 10:14:32: Fatal exception of type "ConfigException"

Exception caught inside exception handler.

Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.

Event Timeline

I have also encountered this for times on zh.wikipedia ( When viewing page history and diff )

Also met this problem in 18:10 (GMT+8)

-42680d98730350f3.jpg (2×1 px, 132 KB)

message
ConfigException from line 53 of /srv/mediawiki/php-1.33.0-wmf.25/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'Kogo'
trace
#0 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/Skin.php(498): GlobalVarConfig->get(string)
#1 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/Skin.php(956): Skin->getLogo()
#2 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/SkinTemplate.php(383): Skin->logoText()
#3 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/SkinTemplate.php(228): SkinTemplate->prepareQuickTemplate()
#4 /srv/mediawiki/php-1.33.0-wmf.25/includes/OutputPage.php(2723): SkinTemplate->outputPage()
#5 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(872): OutputPage->output(boolean)
#6 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(884): MediaWiki->{closure}()
#7 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(515): MediaWiki->main()
#8 /srv/mediawiki/php-1.33.0-wmf.25/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): require(string)
#10 {main}

Seen 928 times in the last 24 hours. I've also just faced it after opening my watchlist.

Although I found it on itwiki where we're always at 1.33-wmf.25...

Well, looking at the message and at this line of code, it seems like it used to read"Kogo" instead of "Logo", although there's no sign of this typo in wmf.25, nor in master.

This error appeared to me on ptwiki when I tried to delete some edit revisions

This is happening randomly when using the enwiki.

+1 comments don't help, just one report is enough :). cc @Gangster8192 and @MrFulano .

Indeed, I echo Urbanecm. Giving high priority because a refresh fixes it and the code already looks correct. If you think it should be UBN, feel free to escalate it.

Mentioned in SAL (#wikimedia-operations) [2019-04-18T12:36:23Z] <anomie> Ran php7adm /opcache-free on mw1274 to test a theory related to T221347. The log entries related to that task stopped immediately.

This error is also in Czech Wikipedia.

Anomie subscribed.

Looking into this, I found that all instances of the "Kogo" exception were on mw1274 and all were using php 7.2. Having seen https://wikitech.wikimedia.org/wiki/Debugging_in_production#PHP7_Opcache recently, I tried running php7adm /opcache-free on mw1274 and the log entries about this occurring immediately stopped.

I have no idea how "Kogo" might have gotten into the opcache in the first place. Adding a few relevant tags for further investigation into that.

(apologies for not submitting this comment earlier, external factors prevented it)

Hmm, L and K having a hamming distance of 3 - Could this possibly be a memory error that wasn't detectable by ECC as a 3 bit error?

Given that they're close on the keyboard, I originally thought of a typo. But that would mean someone edited the file and introduced the typo on mw1274. Could be something more subtle of course.

Krinkle renamed this task from Fatal exception of type "ConfigException" to Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException").Apr 19 2019, 7:15 PM
Krinkle updated the task description. (Show Details)

Shouldn't prio be lowered? If I'm not mistaken, we're in post mortem stage - which can last longer than fixing the error :-).

Yes, that's correct. However I'd encourage to investigate this issue now: you know, logs get garbage collected, people may forget what they did... Generally speaking, it could become hard or impossible to find tracks of this change in a few months. That being said, I checked again, and the first occurrence of "Kogo" on mw1274 is on 18/04 at 9:41 UTC. Nothing suspicious in the SAL at that time. Maybe it could be due to some unlogged change? The ConfigExceptions started at that time with a constant rate of ~10 per minute, so I'd expect the offending change to have happened no more than a couple of minutes before 9:41. But again, the only thing in the SAL around that time is a sync of InitialiseSettings which cannot have caused this issue.

I realized I failed to update this ticket with my investigation:

looking at the metrics we collect we exhausted the opcache string memory quite consistently. I've uploaded and merged a patch this morning that (among other things) bumps that cache to 50 MB from 8 MB.

While this hasn't been applied everywhere for now, it looks like a viable suspect for such a weird behaviour.

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Does that mean we are aware of _other_ occurances of similar bug? Unless I'm blind, there's no comment mentioning that above.

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Does that mean we are aware of _other_ occurances of similar bug? Unless I'm blind, there's no comment mentioning that above.

No, I just said that if this doesn't happen again, we can have some confidence my change did the trick.

Change 508590 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] php-fpm: revalidate opcache every 60 seconds.

https://gerrit.wikimedia.org/r/508590

Change 508590 merged by Giuseppe Lavagetto:
[operations/puppet@production] php-fpm: revalidate opcache every 60 seconds.

https://gerrit.wikimedia.org/r/508590

Today's incident at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache was very similar and presumably the same root cause.

mw1320: ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.3/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseKeyHe`der'

The string literal UseKeyHeader had somehow been corrupted in memory to become UseKeyHe`der.

Keeping task open for now, as I'm unsure what the wider context is and what other changes have been or will be made.

If everything else is still the same, then performing opcache invalidation at regular intervals is presumably not enough to fix avoid the issue. It would merely mean that the sites are "only" corrupt and potentially compromised security/privacy-wise for less than 60 seconds at a time.

But I also see T222705 and something about partial vs full opcache invalidation, leaving open for Joe to elaborate :)

Krinkle renamed this task from Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") to PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable).May 8 2019, 2:00 PM
Krinkle added subscribers: DannyS712, jijiki, Nikerabbit and 5 others.
Joe renamed this task from PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable) to PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).May 8 2019, 2:02 PM

Today's incident at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache was very similar and presumably the same root cause.

mw1320: ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.3/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseKeyHe`der'

The string literal UseKeyHeader had somehow been corrupted in memory to become UseKeyHe`der.

Keeping task open for now, as I'm unsure what the wider context is and what other changes have been or will be made.

If everything else is still the same, then performing opcache invalidation at regular intervals is presumably not enough to fix avoid the issue. It would merely mean that the sites are "only" corrupt and potentially compromised security/privacy-wise for less than 60 seconds at a time.

It's the best compromise at the moment, short of having to rewrite our deployment system from scratch. Finding what's wrong in php-fpm seems even less likely to succeed. So this is just part of a strategy to mitigate the problem

But I also see T222705 and something about partial vs full opcache invalidation, leaving open for Joe to elaborate :)

Pybal detects a server spitting errors way faster than 60 seconds, so with T222705 in place the window in which the corrupted code is running to the smallest possible amount of time.

Also, we've noticed such corruptions happen only (at least in my observations) on some partial invalidation. Once we move the database configuration to etcd (hopefully this quarter) we should be able to fully invalidate opcache on any release and not just for large-ish ones.

This should all reduce the number of potential errors well below the threshold we consider acceptable.

Pybal detects a server spitting errors way faster than 60 seconds, […] in which the corrupted code is running to the smallest possible amount of time.

This assumes the corruption would cause a fatal exception on a popular request, which I think is too optimistic. If the corruption causes a critical class or file to appear missing, that's imho a lucky and relatively positive outcome.

I don't know how the probabilities distribute here, but it seems to me just as likely (possibly more so) that the corruption will change something in the 90% of code that isn't going to cause a fatal error if one character changes. Do we have reason to believe this won't cause user names to be misattributed? Or the settings key for "private wikis are private" to become effectively undefined (by being renamed), at which point the MediaWiki default could be inherited instead which "gracefully" falls back to all wikis being open and accepting public account sign ups.

Even one such account creation succeeding would be pretty hard to detect and recover from, because that would remain valid even after opcache recovers. It seems to me like it's just plain broken and we need to abort until we figure this out.

such corruptions happen only (at least in my observations) on some partial invalidation.

What is the cost of a full invalidation in terms of impact on web requests?

During a deployment today (sync-file wmf-config/CommonSettings), at least two app servers had their cache corrupted.

  • mw1313 PHP Notice: Undefined variable: rlModulesArr /srv/mediawiki/php-1.34.0-wmf.4/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbeddingHandler.php:155

this variable went missing somehow.

  • mw1239 ConfigException: undefined option: 'WhitelistRead' /srv/mediawiki/php-1.34.0-wmf.4/includes/ServiceWiring.php:452

This array key went missing somehow.

Fixed by clearing opcache manually (run scap pull on the two servers locally).

Change 509433 had a related patch set uploaded (by Krinkle; owner: CDanis):
[mediawiki/tools/scap@master] scap: don't do partial invalidations of the opcache

https://gerrit.wikimedia.org/r/509433

Change 509438 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] php-admin: always full-reset opcache, ignore filename list

https://gerrit.wikimedia.org/r/509438

Change 509438 merged by CDanis:
[operations/puppet@production] php-admin: always full-reset opcache, ignore filename list

https://gerrit.wikimedia.org/r/509438

Change 509433 abandoned by CDanis:
scap: don't do partial invalidations of the opcache

Reason:
Ie1b1219a

https://gerrit.wikimedia.org/r/509433

Joe claimed this task.

FWIW, we had no more occurrences of this problem since we disabled partial opcache flushes. I am thus resolving the task.

Could be a coincidence, but I did see similar issues again today. The server did repair itself within a minute or two, but it's still pretty bad that every deployment can send user requests to random fatal errors (or worse, make the software behave in random and unpredictable ways). Continued at T224491.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM