PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	Xiplus
	Apr 18 2019, 10:19 AM

Description

Problem

From T221347#5122170 and T221347#5122194:

On 2019-04-18 from 10:00 UTC (possibly earlier), the code in mediawiki/Skin.php on the mw1274 server appeared to have changed in opcache. This lasted until 13:00 when the opcache was manually cleared.

Effective difference

  function getLogo() {
-     return $this->getConfig()->get( 'Logo' );
+     return $this->getConfig()->get( 'Kogo' );
  }

Original task description by @A2093064:

MediaWiki internal error.

Original exception: [XLhOBwpAAEUAAG7qWroAAABT] 2019-04-18 10:14:32: Fatal exception of type "ConfigException"

Exception caught inside exception handler.

Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.

Details

Subject	Repo	Branch	Lines +/-
scap: don't do partial invalidations of the opcache	mediawiki/tools/scap	master	+2 -2
php-admin: always full-reset opcache, ignore filename list	operations/puppet	production	+6 -1
php-fpm: revalidate opcache every 60 seconds.	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		jijiki	T219127 SRE FY19-20 Q1 goal: complete the transition to PHP7
Resolved		None	T176370 Migrate to PHP 7 in WMF production
Resolved	PRODUCTION ERROR	Joe	T221347 PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)

Event Timeline

Xiplus created this task.Apr 18 2019, 10:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2019, 10:19 AM

I have also encountered this for times on zh.wikipedia ( When viewing page history and diff )

Also met this problem in 18:10 (GMT+8)

message

ConfigException from line 53 of /srv/mediawiki/php-1.33.0-wmf.25/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'Kogo'

trace

#0 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/Skin.php(498): GlobalVarConfig->get(string)
#1 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/Skin.php(956): Skin->getLogo()
#2 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/SkinTemplate.php(383): Skin->logoText()
#3 /srv/mediawiki/php-1.33.0-wmf.25/includes/skins/SkinTemplate.php(228): SkinTemplate->prepareQuickTemplate()
#4 /srv/mediawiki/php-1.33.0-wmf.25/includes/OutputPage.php(2723): SkinTemplate->outputPage()
#5 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(872): OutputPage->output(boolean)
#6 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(884): MediaWiki->{closure}()
#7 /srv/mediawiki/php-1.33.0-wmf.25/includes/MediaWiki.php(515): MediaWiki->main()
#8 /srv/mediawiki/php-1.33.0-wmf.25/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): require(string)
#10 {main}

Seen 928 times in the last 24 hours. I've also just faced it after opening my watchlist.

Although I found it on itwiki where we're always at 1.33-wmf.25...

Well, looking at the message and at this line of code, it seems like it used to read"Kogo" instead of "Logo", although there's no sign of this typo in wmf.25, nor in master.

4shadoww subscribed.Apr 18 2019, 11:36 AM

Daimona merged a task: T221358: Fatal exception of type ConfigException on enwiki.Apr 18 2019, 12:00 PM

Daimona added a subscriber: revi.

Ammarpad subscribed.Apr 18 2019, 12:03 PM

This error appeared to me on ptwiki when I tried to delete some edit revisions

This is happening randomly when using the enwiki.

Urbanecm subscribed.Apr 18 2019, 12:20 PM

+1 comments don't help, just one report is enough :). cc @Gangster8192 and @MrFulano .

Indeed, I echo Urbanecm. Giving high priority because a refresh fixes it and the code already looks correct. If you think it should be UBN, feel free to escalate it.

Mentioned in SAL (#wikimedia-operations) [2019-04-18T12:36:23Z] <anomie> Ran php7adm /opcache-free on mw1274 to test a theory related to T221347. The log entries related to that task stopped immediately.

This error is also in Czech Wikipedia.

Looking into this, I found that all instances of the "Kogo" exception were on mw1274 and all were using php 7.2. Having seen https://wikitech.wikimedia.org/wiki/Debugging_in_production#PHP7_Opcache recently, I tried running php7adm /opcache-free on mw1274 and the log entries about this occurring immediately stopped.

I have no idea how "Kogo" might have gotten into the opcache in the first place. Adding a few relevant tags for further investigation into that.

(apologies for not submitting this comment earlier, external factors prevented it)

Hmm, L and K having a hamming distance of 3 - Could this possibly be a memory error that wasn't detectable by ECC as a 3 bit error?

Given that they're close on the keyboard, I originally thought of a typo. But that would mean someone edited the file and introduced the typo on mw1274. Could be something more subtle of course.

Krinkle moved this task from Apr 2019 / 1.33.wmf.25+ to Mar 2019 / 1.33wmf20-23 on the Wikimedia-production-error board.Apr 18 2019, 7:05 PM

Amorymeltzer subscribed.Apr 18 2019, 11:50 PM

Krinkle renamed this task from Fatal exception of type "ConfigException" to Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException").Apr 19 2019, 7:15 PM

Krinkle updated the task description. (Show Details)

Shouldn't prio be lowered? If I'm not mistaken, we're in post mortem stage - which can last longer than fixing the error :-).

Yes, that's correct. However I'd encourage to investigate this issue now: you know, logs get garbage collected, people may forget what they did... Generally speaking, it could become hard or impossible to find tracks of this change in a few months. That being said, I checked again, and the first occurrence of "Kogo" on mw1274 is on 18/04 at 9:41 UTC. Nothing suspicious in the SAL at that time. Maybe it could be due to some unlogged change? The ConfigExceptions started at that time with a constant rate of ~10 per minute, so I'd expect the offending change to have happened no more than a couple of minutes before 9:41. But again, the only thing in the SAL around that time is a sync of InitialiseSettings which cannot have caused this issue.

I realized I failed to update this ticket with my investigation:

looking at the metrics we collect we exhausted the opcache string memory quite consistently. I've uploaded and merged a patch this morning that (among other things) bumps that cache to 50 MB from 8 MB.

While this hasn't been applied everywhere for now, it looks like a viable suspect for such a weird behaviour.

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

In T221347#5126757, @Joe wrote:

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Does that mean we are aware of _other_ occurances of similar bug? Unless I'm blind, there's no comment mentioning that above.

BethNaught subscribed.Apr 20 2019, 9:30 PM

Krinkle mentioned this in Blog Post: Production Excellence #9: March 2019.Apr 21 2019, 6:51 PM

In T221347#5126824, @Urbanecm wrote:

In T221347#5126757, @Joe wrote:

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Does that mean we are aware of _other_ occurances of similar bug? Unless I'm blind, there's no comment mentioning that above.

No, I just said that if this doesn't happen again, we can have some confidence my change did the trick.

Krinkle moved this task from Untriaged to Wikimedia production on the PHP 7.2 support board.Apr 25 2019, 9:33 PM

Change 508590 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] php-fpm: revalidate opcache every 60 seconds.

https://gerrit.wikimedia.org/r/508590

gerritbot added a project: Patch-For-Review.May 7 2019, 2:36 PM

Change 508590 merged by Giuseppe Lavagetto:
[operations/puppet@production] php-fpm: revalidate opcache every 60 seconds.

https://gerrit.wikimedia.org/r/508590

Krinkle mentioned this in T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error".May 7 2019, 3:02 PM

Today's incident at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache was very similar and presumably the same root cause.

mw1320: ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.3/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseKeyHe`der'

The string literal UseKeyHeader had somehow been corrupted in memory to become UseKeyHe`der.

Keeping task open for now, as I'm unsure what the wider context is and what other changes have been or will be made.

If everything else is still the same, then performing opcache invalidation at regular intervals is presumably not enough to fix avoid the issue. It would merely mean that the sites are "only" corrupt and potentially compromised security/privacy-wise for less than 60 seconds at a time.

But I also see T222705 and something about partial vs full opcache invalidation, leaving open for Joe to elaborate :)

Krinkle renamed this task from Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") to PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable).May 8 2019, 2:00 PM

Krinkle added a parent task: T219127: SRE FY19-20 Q1 goal: complete the transition to PHP7.

Krinkle merged a task: T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?) .

Krinkle added subscribers: DannyS712, jijiki, Nikerabbit and 5 others.

Joe renamed this task from PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable) to PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).May 8 2019, 2:02 PM

In T221347#5165706, @Krinkle wrote:

Today's incident at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache was very similar and presumably the same root cause.

mw1320: ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.3/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseKeyHe`der'

The string literal UseKeyHeader had somehow been corrupted in memory to become UseKeyHe`der.

Keeping task open for now, as I'm unsure what the wider context is and what other changes have been or will be made.

If everything else is still the same, then performing opcache invalidation at regular intervals is presumably not enough to fix avoid the issue. It would merely mean that the sites are "only" corrupt and potentially compromised security/privacy-wise for less than 60 seconds at a time.

It's the best compromise at the moment, short of having to rewrite our deployment system from scratch. Finding what's wrong in php-fpm seems even less likely to succeed. So this is just part of a strategy to mitigate the problem

But I also see T222705 and something about partial vs full opcache invalidation, leaving open for Joe to elaborate :)

Pybal detects a server spitting errors way faster than 60 seconds, so with T222705 in place the window in which the corrupted code is running to the smallest possible amount of time.

Also, we've noticed such corruptions happen only (at least in my observations) on some partial invalidation. Once we move the database configuration to etcd (hopefully this quarter) we should be able to fully invalidate opcache on any release and not just for large-ish ones.

This should all reduce the number of potential errors well below the threshold we consider acceptable.

Pybal detects a server spitting errors way faster than 60 seconds, […] in which the corrupted code is running to the smallest possible amount of time.

This assumes the corruption would cause a fatal exception on a popular request, which I think is too optimistic. If the corruption causes a critical class or file to appear missing, that's imho a lucky and relatively positive outcome.

I don't know how the probabilities distribute here, but it seems to me just as likely (possibly more so) that the corruption will change something in the 90% of code that isn't going to cause a fatal error if one character changes. Do we have reason to believe this won't cause user names to be misattributed? Or the settings key for "private wikis are private" to become effectively undefined (by being renamed), at which point the MediaWiki default could be inherited instead which "gracefully" falls back to all wikis being open and accepting public account sign ups.

Even one such account creation succeeding would be pretty hard to detect and recover from, because that would remain valid even after opcache recovers. It seems to me like it's just plain broken and we need to abort until we figure this out.

such corruptions happen only (at least in my observations) on some partial invalidation.

What is the cost of a full invalidation in terms of impact on web requests?

DannyS712 mentioned this in T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?) .May 9 2019, 6:12 AM

During a deployment today (sync-file wmf-config/CommonSettings), at least two app servers had their cache corrupted.

mw1313 PHP Notice: Undefined variable: rlModulesArr /srv/mediawiki/php-1.34.0-wmf.4/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbeddingHandler.php:155

this variable went missing somehow.

mw1239 ConfigException: undefined option: 'WhitelistRead' /srv/mediawiki/php-1.34.0-wmf.4/includes/ServiceWiring.php:452

This array key went missing somehow.

Fixed by clearing opcache manually (run scap pull on the two servers locally).

Change 509433 had a related patch set uploaded (by Krinkle; owner: CDanis):
[mediawiki/tools/scap@master] scap: don't do partial invalidations of the opcache

https://gerrit.wikimedia.org/r/509433

Krinkle added a parent task: T176370: Migrate to PHP 7 in WMF production.May 10 2019, 3:28 PM

Change 509438 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] php-admin: always full-reset opcache, ignore filename list

https://gerrit.wikimedia.org/r/509438

Change 509438 merged by CDanis:
[operations/puppet@production] php-admin: always full-reset opcache, ignore filename list

https://gerrit.wikimedia.org/r/509438

Change 509433 abandoned by CDanis:
scap: don't do partial invalidations of the opcache

Reason:
Ie1b1219a

https://gerrit.wikimedia.org/r/509433

ArielGlenn removed a project: Patch-For-Review.May 15 2019, 10:55 AM

FWIW, we had no more occurrences of this problem since we disabled partial opcache flushes. I am thus resolving the task.

Patriccck awarded a token.May 23 2019, 1:05 PM

Krinkle mentioned this in T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262).May 28 2019, 4:12 PM

Could be a coincidence, but I did see similar issues again today. The server did repair itself within a minute or two, but it's still pretty bad that every deployment can send user requests to random fatal errors (or worse, make the software behave in random and unpredictable ways). Continued at T224491.

Krinkle closed this task as a duplicate of T222452: PHP Fatal Errors on mw1275 after deployment.May 29 2019, 3:56 PM

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptMay 29 2019, 3:56 PM

Krinkle changed the task status from Duplicate to Resolved.May 29 2019, 3:57 PM

Krinkle merged a task: T222452: PHP Fatal Errors on mw1275 after deployment.

Krinkle mentioned this in T219242: deployment-mediawiki-09 PHP7 has broken cache of codebase?.

Krinkle added a subscriber: Lucas_Werkmeister_WMDE.

Krinkle mentioned this in T230037: Create warmup procedure for MediaWiki app servers.Aug 7 2019, 3:57 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM

Daimona mentioned this in T253673: Avoid php-opcache corruption in WMF production.Jul 8 2020, 6:43 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:40 PM

Tgr mentioned this in T266052: Interface 'MediaWiki\EditPafe\IEditObject' not found.Oct 21 2020, 6:49 AM

	F28693697: -42680d98730350f3.jpg
	Apr 18 2019, 10:23 AM

PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)Closed, ResolvedPublicPRODUCTION ERRORActions