Page MenuHomePhabricator

Avoid php-opcache corruption in WMF production
Open, HighPublic

Description

NOTE: Please send reports of corruptions to T245183, not here.
Background

The main tracking task for the php-opcache corruption we saw in production was T224491. This was closed in September after various workarounds were put in place that reduced the chances of corruption happening.

In a nut shell:

  • In order for PHP to not perform unacceptably slow, it is required that we have a compilation cache. This is enabled by default and is called opcache. Similar systems existed in HHVM and PHP5 as well. (It is not new). This system translates the .php files from disk into a parsed and optimised machine-readable format, stored in RAM.
  • Our current deployment model is based on changing files directly on disk (Scap, and rsync), and does not involve containers, new servers, or servers being depooled/re-pooled. Instead, they remain live serving.
  • Updates to files are picked up by PHP by comparing the mtime of files on disk if more than a configurable number of seconds have past since the last time it checked.
  • When it finds such an update, it recompiles the file on-the-fly and adds it to memory. It does not remove or replace the existing entry, as another on-going request might still be using that.
  • In addition to not removing or replacing on-demand, there is also no background process to or other garbage collection concept in place. Instead, it grows indefinitely until it runs out of memory, at which point it is forced to reset the opcache and start from scratch. When this happens, php7-opcache shits itself and causes unrecoverable corruption to the interpreted source code. (Unrecoverable, meaning, it is not temporary or self-correcting, any corruption that occurs tends to be sticky until a human restarts the server.)

What we did to close T224491:

  • A cronjon is active on all MW app servers that checks every few minutes if opcache is close to running out of memory. If it is, we'll try to prevent it corrupting itself by voluntarily depooling the server automatically, then doing a restart cleanly in a way that has no live traffic and thus presumably no way to trigger the race condition that causes the corruption, and then repool it. This cronjob has Icgina alerting on it. And it is spread out so that we don't restart "too many" servers at once.
  • The Scap deployment tool also enacts the same script as the cronjob to perform this restart around deployments, so that if we know we're close to running out of memory we won't wait for traffic to increase memory for the new source code, but rather catch it proactively.
Status quo

We still see corruptions from time to time. New ones are now tracked at T245183.

We are kind of stuck because any kind of major deployment or other significant temporary or indefinite utilisation of opcache (e.g. T99740) should involve a php-fpm restart to be safe, but we can't easily do a rolling restart because:

  • Live traffic has to go somewhere, so we can't restart all at once.
  • If we don't restart all at once, that means we have to do a slow rolling one.
  • Which means, deployments take 15 minutes or no longer. This would be a huge increase compared to the 1-2 minutes it takes today.
Ideas
  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.
    • Method: Memory is controlled by not building up stale copies of old code.
    • Benefit: Easy to add..
    • Downside: We keep all the cronjob and scap complexity we have today.
  2. Spawn fresh php-fpm instances for each deploy.
    • Method: Some kind of socket transfer for live traffic. Automatic opcache updates would be disabled, thus memory can't grow indefinitely.
    • Benefit: Fast deployment. Relatively clean and easy to reason about.
    • Benefit: We get to remove the complexity we have today.
    • Benefit: We get to prepare and re-use what we learn here for the direction of "MW on containers".
    • Downside: Non-trivial to build.
    • Downside: More disruption to apcu lifetime. We may need to do one of T244340 or T248005 first in that case.
  3. Get the php-opcache bug(s) fixed upstream.
    • Method: Contractor?
    • Benefit: Fast deployment. Relatively clean and easy to reason about.
    • Benefit: We get to remove the complexity we have today.
    • Downside: Unsure if php-opcache is beyond fixing.

Related Objects

Event Timeline

Krinkle created this task.May 26 2020, 6:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2020, 6:21 PM
  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

We currently don't have coordination from Scap with the cronjobs (which could interfere), so PoolCounter could be used to ensure coordination with thatE.g. we'd have at most N servers in a DC undergoing restart, the server would take care of it locally, and Scap just invokes the script on all servers and each one waits as needed until it's done. Another way could be to communicate with PyBall instead and base it on ensuring a minimum number of pooled servers (as opposed to ensuring a max of depooled servers). I suppose there can be race conditions there though, so maybe PoolCounter is the better way.

  1. Spawn fresh php-fpm instances for each deploy.

What would it take do this?

  1. Get the php-opcache bug(s) fixed upstream.

[…]
Downside: Unsure if php-opcache is beyond fixing.

TODO: Ref upstream tickets and determine if there is any hope.

  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

The current paradigm of deployments involves rolling forward to individual groups of wikis as a way to de-risk a given deployment. We don't have another way to gain that confidence currently. Without that confidence we need a fast rollback mechanism. Maybe making rollbacks faster is easier to solve than making every deployment fast (?) -- not that I know how to do that exactly :)

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

The way this is currently implemented is all the groups of servers (jobrunner, appserver, appserver_api, testserver) are restarted in parallel using a pool of workers. Not more than 10% of a given group is restarted/depooled at the same time. @Joe could probably speak to the actual depooling piece.

ori added a subscriber: ori.EditedMay 27 2020, 10:26 PM

A contractor is going to ask for the same thing that upstream is asking for: a reproduction case. There's not much to go on without one.

Have you tried depooling an appserver, disabling the workarounds on that server, turning on opcache.protect_memory=1, setting low limits for opcache, and hitting the server with synthetic traffic (replay GET requests from production logs using ab)?

CDanis added a subscriber: CDanis.May 27 2020, 11:09 PM
Krinkle updated the task description. (Show Details)Jun 2 2020, 2:42 PM
Krinkle triaged this task as High priority.Jun 22 2020, 7:10 PM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
jijiki added a subscriber: jijiki.Jul 1 2020, 2:31 PM
jcrespo added a subscriber: jcrespo.Jul 8 2020, 2:35 PM

Today at 2020-07-08T14:00:12, with no deployments happening, mw1346 started generating exceptions at high rate:

/srv/mediawiki/php-1.35.0-wmf.39/includes/config/GlobalVarConfig.php:53 GlobalVarConfig::get: undefined option: 'MinervaOverflowInPageActhons'

#0 /srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php(131): GlobalVarConfig->get(string)
#1 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MinervaHooks::onMobileFrontendFeaturesRegistration(MobileFrontend\Features\FeaturesManager)
#2 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#3 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/Features/FeaturesManager.php(43): MediaWiki\HookContainer\HookContainer->run(string, array)
#4 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/ServiceWiring.php(55): MobileFrontend\Features\FeaturesManager->useHookToRegisterExtensionOrSkinFeatures()
#5 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(451): Wikimedia\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#6 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(419): Wikimedia\Services\ServiceContainer->createService(string)
#7 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/MobileFrontendHooks.php(1115): Wikimedia\Services\ServiceContainer->getService(string)
#8 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MobileFrontendHooks::onMakeGlobalVariablesScript(array, OutputPage)
#9 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#10 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookRunner.php(2516): MediaWiki\HookContainer\HookContainer->run(string, array)
#11 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3386): MediaWiki\HookContainer\HookRunner->onMakeGlobalVariablesScript(array, OutputPage)
#12 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3035): OutputPage->getJSVars()
#13 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3056): OutputPage->getRlClient()
#14 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(82): OutputPage->headElement(SkinApi)
#15 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(57): SkinMustache->getTemplateData()
#16 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinTemplate.php(141): SkinMustache->generateHTML()
#17 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(2616): SkinTemplate->outputPage()
#18 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatBase.php(333): OutputPage->output()
#19 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatRaw.php(82): ApiFormatBase->closePrinter()
#20 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(1834): ApiFormatRaw->closePrinter()
#21 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(608): ApiMain->printResult(integer)
#22 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(532): ApiMain->handleException(ConfigException)
#23 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(496): ApiMain->executeActionWithErrorHandling()
#24 /srv/mediawiki/php-1.35.0-wmf.39/api.php(89): ApiMain->execute()
#25 /srv/mediawiki/php-1.35.0-wmf.39/api.php(44): wfApiMain()
#26 /srv/mediawiki/w/api.php(3): require(string)
#27 {main}

Note the spelling of MinervaOverflowInPageActhons.

Pybal quickly depooled the host, but probably monitoring kept querying the server and generating the exceptions.

@Joe did php7adm /opcache-free on mw1346 and the exceptions cleared.

Joe added a comment.Jul 8 2020, 2:39 PM

Interestingly, according to the opcache metadata, the file where the error was (/srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php) was in opcache since 2 weeks, which isn't great because it means nothing really caused this issue:

  • no deploy
  • no opcache invalidation by chance
  • no php restarts
CDanis added a comment.Jul 8 2020, 6:09 PM

FTR i-->h is a single bit-flip in the LSB.

CDanis added a comment.Jul 8 2020, 6:23 PM

FTR i-->h is a single bit-flip in the LSB.

Sorry, I was off by one; it's actually a transposition. So seems much less likely to be a random flip.

I think it's also interesting to compare this failure to T221347: there it was L -> K. It's a -1 in both cases. Unsure what this could mean...

jijiki added a comment.EditedJul 16 2020, 8:51 PM

In order to move this a little bit forward, we can try to reproduce and have a go at @ori 's suggestion. If we don't get anywhere, we will revisit the pros and cons of restarting after every deploy (option 1 in description), what we can do to optimise it, and possibly proceed with it. I hope to make some time for this in August.

Krinkle updated the task description. (Show Details)Jul 16 2020, 9:20 PM
ori added a comment.EditedAug 15 2020, 5:16 AM

Chatted a bit about this with @jijiki. If the data corruption was triggered by code deployments, maybe the way to reproduce this bug is to simulate the effect of many code deployments by spamming opcache with auto-generated, randomized PHP code.

The generated code would consist of a class with a $data property, a $hash property, and a method that verifies that md5($this->data) == $this->hash. The name of the class and the method and the literal values of $data and $hash would be randomly generated by the codegen script. (Randomizing both the value of the data attribute and the class and method names should help ensure we stress both the code cache and the interned strings buffer.)

The test harness would generate the code, copy the generated PHP code to the server's document root, curl it multiple times in parallel, and repeat. It should be possible to run many iterations of this test on a depooled app server very quickly, with no risk of corrupting production data or triggering MediaWiki bugs.

One reason this might not work is if the bug is not truly internal to opcache -- e.g., if there's some specific PHP extension that does something opcache doesn't expect, etc. If that's the case, a different approach would be needed.

Thoughts?

ori added a comment.Aug 15 2020, 5:52 PM

Script to generate randomized, self-validating code:

ori added a comment.Aug 15 2020, 6:10 PM

And on the subject of useful php.ini debug settings:

opcache.protect_memory

If the bug is caused by a PHP extension mutating data that opcache expects to be immutable, opcache.protect_memory=1 should help by causing a crash with a stack trace at the point of mutation. PHP bug #73933 is an example of a bug like that.

Turning on opcache.protect_memory for the stress test I proposed above won't be useful, because the randomized code for the stress test doesn't exercise any PHP extensions. But the setting could be useful for the tests that replay requests with the full MediaWiki codebase.

opcache.consistency_checks

Came across this one today:

opcache.consistency_checks integer
If non-zero, OPcache will verify the cache checksum every N requests, where N is the value of this configuration directive. This should only be enabled when debugging, as it will impair performance.

Here's the code that actually performs the check:
https://github.com/php/php-src/blob/517c9938af/ext/opcache/ZendAccelerator.c#L2119-L2142

I'm not totally sure what it does, but it looks like it includes an Adler32 checksum with each compiled script cache entry, and verifies it on load. If there's a checksum mismatch it logs an INFO-level message and restarts opcache.

I wonder if the impact on performance would really be so bad if this is turned on in production with a value of, say, 1000.

The test harness would generate the code, copy the generated PHP code to the server's document root, curl it multiple times in parallel, and repeat. It should be possible to run many iterations of this test on a depooled app server very quickly, with no risk of corrupting production data or triggering MediaWiki bugs.

One reason this might not work is if the bug is not truly internal to opcache -- e.g., if there's some specific PHP extension that does something opcache doesn't expect, etc. If that's the case, a different approach would be needed.

Thoughts?

I am wondering if this test will increase the wasted memory, and trigger an opcache restart with an empty cache. I think we have set this to 10%. We do suspect that maybe code deployments per se might not be teh issue, one file mentioned in this thread was cached weeks before its corruption. Nevertheless, it is worth a shot and testing is cheap, we may test it.

And on the subject of useful php.ini debug settings:

opcache.protect_memory

If the bug is caused by a PHP extension mutating data that opcache expects to be immutable, opcache.protect_memory=1 should help by causing a crash with a stack trace at the point of mutation. PHP bug #73933 is an example of a bug like that.

+1 I will try that

opcache.consistency_checks

Came across this one today:

opcache.consistency_checks integer
If non-zero, OPcache will verify the cache checksum every N requests, where N is the value of this configuration directive. This should only be enabled when debugging, as it will impair performance.

Here's the code that actually performs the check:
https://github.com/php/php-src/blob/517c9938af/ext/opcache/ZendAccelerator.c#L2119-L2142

I'm not totally sure what it does, but it looks like it includes an Adler32 checksum with each compiled script cache entry, and verifies it on load. If there's a checksum mismatch it logs an INFO-level message and restarts opcache.

I wonder if the impact on performance would really be so bad if this is turned on in production with a value of, say, 1000.

I didn't know about this, sounds promising!

Summing up, after we finish up with a minor maintenance we have been doing on our clusters, we can do some testing hoping to reproduce the corruption. With @Krinkle we have a list of webrequests we can start with and run it against

  • an app server, disable the systemd timer and run an ab test from mwdebug*.
  • an api server, with opcache.protect_memory=1 and wait for a segfault.

Morever we can,

  • test performance of opcache.consistency_checks, see if it makes sense to have it enabled on some servers or include it in the above ones
  • run codegen and see if it adds something here or confuses us more
jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM

Change 622761 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php::restarts: Allow disabling of php-fpm restarts

https://gerrit.wikimedia.org/r/622761

Change 622762 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622762

Change 622762 abandoned by Effie Mouzeli:
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

Reason:
conflict

https://gerrit.wikimedia.org/r/622762

Change 622765 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622765

Change 622761 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::php::restarts: Allow disabling of php-fpm restarts

https://gerrit.wikimedia.org/r/622761

Change 622765 merged by Effie Mouzeli:
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622765

Change 625224 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

Change 625224 merged by Effie Mouzeli:
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

@Joe said today that a rolling restart takes 5-10 minutes, not 15 minutes.

RE: Having an option to take a shortcut, I don't oppose it existing per se, but I don't think skipping the restarts as proposed in T244544/T243009 would be effective because solving this task requires that we disable the dangerous revalidation option in opcache. Thus not restarting equates to effectively not having deployed any code.

However if we want to have a shortcut that doesn't do things in a rolling way, but immediately sends an all-out restart, that's something for @Joe and team to balance and decide how much we could e.g.. take short cuts in a disaster scenario where e.g. if most requests are http 5xx anyway, and if not having appserver capacity is handled gracefully higherr in the traffic stack, then maybe an option for a bigger more risky batch could make sense.

But, I think for the short term, we should proceed without this and just take 5-10 min as our new safe default and then work on reducing it.

Is anything else blocking this? Could we try it for a few deploys and test drive?

CDanis added a comment.Oct 1 2020, 7:31 PM

BTW, I produced a short writeup aimed at deployers and others close to production: https://wikitech.wikimedia.org/wiki/User:CDanis/Diagnosing_opcache_corruption

mmodell added a subscriber: mmodell.Oct 5 2020, 8:26 PM
kostajh added a subscriber: kostajh.Oct 5 2020, 8:36 PM

My idea for detection/prevention of opcache corruption is to use a memory protection key to do essentially what opcache.protect_memory=1 does, but fast enough for it to be always enabled in production.

My theory of opcache corruption is that the large number of pointers into shared memory during a request provides many opportunities for accidental writes, due to dangling pointers or other programmer errors. It's not feasible to do mprotect() on the shared memory every time shared memory is written to, because mprotect() needs to write to every page table entry which makes it O(N) in the size of the segment. Every request writes to shared memory, because shared memory contains locks which are incremented and decremented during read operations.

My idea is to tag shared memory with a pkey. Then when entering or exiting a section of the code that writes to shared memory, only a single instruction (WRPKRU) needs to be executed to change the permissions on all of shared memory.

The goal is to convert shared memory corruption into segfaults, which are less damaging in production. Segfaults can produce core dumps, potentially giving a lead as to the root cause of the memory corruption.

jijiki added a comment.EditedOct 21 2020, 4:43 PM

Yesterday we had opcache corruptions on 2 servers, mw2328 && mw2252. I don't know about other times, but for those specific 2 corruptions, I can say that they happened right after opcache restarted because, on these servers it reached its max cached keys:

mw2328:
    "start_time": 1600177590, -> Tuesday, 15 September 2020 
    "last_restart_time": 1603211850, -> Tuesday, 20 October 2020 16:37:30
    "oom_restarts": 0,
    "hash_restarts": 2,

mw2252:
    "start_time": 1600174055, -> Tuesday, 15 September 2020 12:47:35
    "last_restart_time": 1603217533, -> Tuesday, 20 October 2020 18:12:13
    "oom_restarts": 0,
    "hash_restarts": 2,

Yesterday we had some more servers that had their opcache restarted, for the same reasons, looking for servers opcache_statistics.hash_restarts is 2:

(48) mw[2218-2220,2222-2223,2252-2253,2262,2283,2285-2289,2291-2300,2304,2306,2308,2317,2320-2324,2328,2332,2334,2350,2352,2358,2360,2362,2364,2366-2368,2370,2372,2374].codfw.wmnet
----- OUTPUT of 'php7adm  /opcach...cs.hash_restarts' -----
2

Looking at when those servers had their opcache restarted, it was yesterday, and most of them (if not all, I have not checked yet) are api servers:

mw2218.codfw.wmnet: 1603215616
mw2219.codfw.wmnet: 1603212256
mw2220.codfw.wmnet: 1603215818
mw2222.codfw.wmnet: 1603222186
mw2223.codfw.wmnet: 1603218900
mw2252.codfw.wmnet: 1603217533
mw2253.codfw.wmnet: 1603212502
mw2262.codfw.wmnet: 1603224353
mw2283.codfw.wmnet: 1603219582
mw2285.codfw.wmnet: 1603215620
mw2286.codfw.wmnet: 1603222040
mw2287.codfw.wmnet: 1603211649
mw2288.codfw.wmnet: 1603215617
mw2289.codfw.wmnet: 1603217488
mw2291.codfw.wmnet: 1603213619
mw2292.codfw.wmnet: 1603211083
mw2293.codfw.wmnet: 1603222034
mw2294.codfw.wmnet: 1603220461
mw2295.codfw.wmnet: 1603215815
mw2296.codfw.wmnet: 1603213118
mw2297.codfw.wmnet: 1603221654
mw2298.codfw.wmnet: 1603219884
mw2299.codfw.wmnet: 1603221725
mw2300.codfw.wmnet: 1603215618
mw2304.codfw.wmnet: 1603221327
mw2306.codfw.wmnet: 1603221212
mw2308.codfw.wmnet: 1603214999
mw2317.codfw.wmnet: 1603216066
mw2320.codfw.wmnet: 1603214233
mw2321.codfw.wmnet: 1603216670
mw2322.codfw.wmnet: 1603220571
mw2323.codfw.wmnet: 1603220597
mw2324.codfw.wmnet: 1603221161
mw2328.codfw.wmnet: 1603211850
mw2332.codfw.wmnet: 1603221654
mw2334.codfw.wmnet: 1603223000
mw2350.codfw.wmnet: 1603221653
mw2352.codfw.wmnet: 1603224062
mw2358.codfw.wmnet: 1603217075
mw2360.codfw.wmnet: 1603220878
mw2362.codfw.wmnet: 1603217662
mw2364.codfw.wmnet: 1603215611
mw2366.codfw.wmnet: 1603213081
mw2367.codfw.wmnet: 1603200046
mw2368.codfw.wmnet: 1603220098
mw2370.codfw.wmnet: 1603221962
mw2372.codfw.wmnet: 1603216361
mw2374.codfw.wmnet: 1603213716

That being said, we can enhance the cronjob script we have to check this metric as well, and trigger a php-fpm restart. So it will be restarted when free opcache is below 200mb or if cached keys are over 32k.

My idea is to tag shared memory with a pkey. Then when entering or exiting a section of the code that writes to shared memory, only a single instruction (WRPKRU) needs to be executed to change the permissions on all of shared memory.

The goal is to convert shared memory corruption into segfaults, which are less damaging in production. Segfaults can produce core dumps, potentially giving a lead as to the root cause of the memory corruption.

How difficult would that be to implement? It sounds relatively straight-forward but I'm not very familiar with php internals. Segfault would definitely be a huge improvement over the random behavior we've been seeing.

Change 635854 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki: Check number of cached keys in php-check-and-restart.sh

https://gerrit.wikimedia.org/r/635854

Change 635854 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: Check number of cached keys in php-check-and-restart.sh

https://gerrit.wikimedia.org/r/635854

Change 636047 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php bump opcache.max_accelerated_files

https://gerrit.wikimedia.org/r/636047