Page MenuHomePhabricator

PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Two kinds of errors appeared during a deployment earlier today.

These was limited to one server (mw1262), one PHP run-time PHP 7, and recovered within a minute or two. This suggests it's related to opcache corruption or some such.

See also: T221347

Fatal error due to undefined interface

Request ID: XO1PdwpAADkAAAL-DwYAAABE

message
[{exception_id}] {exception_url}   PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

It seems there is no useful stacktrace. Ping T187147.

hostmw1262 / php7.2
Duration15:10-15:11 (2 minutes)
PHP error - undefined key

Request ID: XO1PcgpAADkAAAL-Do8AAABA

message
PHP Notice: Undefined index: userlangattributes

#0 /srv/mediawiki/php-1.34.0-wmf.6/includes/skins/QuickTemplate.php(117): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.34.0-wmf.6/skins/Vector/includes/VectorTemplate.php(229): QuickTemplate->html(string)
#2 /srv/mediawiki/php-1.34.0-wmf.6/skins/Vector/includes/VectorTemplate.php(205): VectorTemplate->renderPortal(string, array)
#3 /srv/mediawiki/php-1.34.0-wmf.6/skins/Vector/includes/VectorTemplate.php(118): VectorTemplate->renderPortals(array)
#4 /srv/mediawiki/php-1.34.0-wmf.6/includes/skins/SkinTemplate.php(230): VectorTemplate->execute()
#5 /srv/mediawiki/php-1.34.0-wmf.6/includes/OutputPage.php(2749): SkinTemplate->outputPage()
#6 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWException.php(180): OutputPage->output()
#7 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWException.php(216): MWException->reportHTML()
#8 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionHandler.php(119): MWException->report()
#9 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionHandler.php(195): MWExceptionHandler::report(MWException)
#10 /srv/mediawiki/php-1.34.0-wmf.6/includes/MediaWiki.php(542): MWExceptionHandler::handleException(MWException)
#11 /srv/mediawiki/php-1.34.0-wmf.6/index.php(42): MediaWiki->run()
hostmw1262 / php7.2
Duration15:10

Error from line 448 of /srv/mediawiki/php-1.34.0-wmf.6/includes/CommentStore.php: Call to undefined method CommentStore::getCommentInternal()

Count2,256
Host(s)mw1329 / php72
Duration13:59 - 14:09 (10 mintes)
Call to undefined method PermissionManager::checkPageRestrictions()
message
[{exception_id}] {exception_url}   Error from line 264 of /srv/mediawiki/php-1.34.0-wmf.6/includes/Permissions/PermissionManager.php: Call to undefined method MediaWiki\Permissions\PermissionManager::checkPageRestrictions()

#0 /srv/mediawiki/php-1.34.0-wmf.6/includes/Permissions/PermissionManager.php(117): MediaWiki\Permissions\PermissionManager->getPermissionErrorsInternal(string, User, Title, string, boolean)
#1 /srv/mediawiki/php-1.34.0-wmf.6/includes/Title.php(2217): MediaWiki\Permissions\PermissionManager->userCan(string, User, Title, string)
#2 /srv/mediawiki/php-1.34.0-wmf.6/includes/Title.php(2185): Title->userCan(string, User, string)
#3 /srv/mediawiki/php-1.34.0-wmf.6/includes/skins/SkinTemplate.php(987): Title->quickUserCan(string, User)
#4 /srv/mediawiki/php-1.34.0-wmf.6/includes/skins/SkinTemplate.php(464): SkinTemplate->buildContentNavigationUrls()
#5 /srv/mediawiki/php-1.34.0-wmf.6/includes/skins/SkinTemplate.php(228): SkinTemplate->prepareQuickTemplate()
#6 /srv/mediawiki/php-1.34.0-wmf.6/includes/OutputPage.php(2749): SkinTemplate->outputPage()
#7 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionRenderer.php(134): OutputPage->output()
#8 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionRenderer.php(53): MWExceptionRenderer::reportHTML(Error)
#9 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionHandler.php(121): MWExceptionRenderer::output(Error, integer)
#10 /srv/mediawiki/php-1.34.0-wmf.6/includes/exception/MWExceptionHandler.php(195): MWExceptionHandler::report(Error)
#11 /srv/mediawiki/php-1.34.0-wmf.6/includes/MediaWiki.php(545): MWExceptionHandler::handleException(Error)
#12 /srv/mediawiki/php-1.34.0-wmf.6/index.php(42): MediaWiki->run()
Count1,974
Hostmw1273 / php72
Duration15-20 minutes
GlobalVarConfig::get: undefined option
[{exception_id}] {exception_url} ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.6/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseES'
Count895 exceptions
hostmw1326 / php72
Duration~ 10 minutes
Two different corruptions on the same server
MWException: Invalid callback \Wikibase\Client\Hooks\MagicWordHookHandlers::onMagicWordwgVariableID in hooks for MagicWordwgVariableIDs

#0 /srv/mediawiki/php-1.34.0-wmf.6/includes/Hooks.php
…
#9 /srv/mediawiki/php-1.34.0-wmf.6/api.php(87): ApiMain->execute()
Hostmw1227 / php72
Duration18:51 (1 minute)
Fatal error: Class 'Wikibase\Lib\EntityTypeDefinitions' not found

#0 /srv/mediawiki/php-1.34.0-wmf.6/extensions/Wikibase/client/includes/WikibaseClient.php(930): Wikibase\Client\WikibaseClient::newInstance()
…
#9 /srv/mediawiki/php-1.34.0-wmf.6/api.php(87): ApiMain->execute()
Hostmw1227 / php72
Duration18:50 (1 minute)

Related Objects

Mentioned In
T314240: Enable rolling restart for all MW servers (tracking)
T254210: ParameterAssertionException "Bad value for parameter $row->rev_timestamp" from RevisionStoreRecord.php
T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
T253673: Avoid php-opcache corruption in WMF production
T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
T233561: Pending changes: autoreview randomly fails
T231486: [Bug] ArgumentCountError in MobileFrontend/includes/amc/Manager.php: Too few arguments to function MobileFrontend\AMC\Manager::__construct()
T231089: WikibaseClient.php: PHP Notice: Undefined index:
T230037: Create warmup procedure for MediaWiki app servers
T229433: ResourceLoaderWikiModule.php: PHP Notice: Undefined index:
T224227: php-1.34.0-wmf.6/includes/TemplateParser.php(149) : eval()'d code: syntax error, unexpected '=>' (T_DOUBLE_ARROW), expecting ')'
T225197: PHP Warning from /w/thumb.php: "Cannot modify header information - headers already sent"
T224857: Enhance MediaWiki deployments for support of php7.x
T187147: Port mediawiki/php/wmerrors to PHP7 and deploy
T219242: deployment-mediawiki-09 PHP7 has broken cache of codebase?
T220732: 1.34.0-wmf.7 deployment blockers
T223602: Define variant Wikimedia production config in compiled, static files
T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)
T224488: PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found
Mentioned Here
T228346: PHP 7.2 garbage collector segfault
T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters
T224857: Enhance MediaWiki deployments for support of php7.x
T225018: [Bug] TypeError in PopupsContext
T224493: Call to undefined method CommentStore::getCommentInternal
T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable)
T187147: Port mediawiki/php/wmerrors to PHP7 and deploy

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm going to assume for now that T224493 is the same issue, because it too only affected PHP 7.2 and only happened on 1 server. Although it is worth noting that this issue lasted 10 minutes, instead of the usual 1-2 minutes.

Error from line 448 of /srv/mediawiki/php-1.34.0-wmf.6/includes/CommentStore.php: Call to undefined method CommentStore::getCommentInternal()

Count2,256
Duration13:59 - 14:09 (10 mintes)
Host(s)mw1329
See also
Random comments
  • "Sometimes a method name gets corrupted [..] (a single character being replaced with another)." – SO
  • "We do have this problem too on production, during heavy loading we got errors about sql queries that get mystery changed to "SELECT * FRPM" (notice the P), or links in our pages with www changed to vww." – php.net
  • "Seems the opcache is randomly being corrupted one character at a time." – php.net
Random suggestions

Some comments suggest that the race condition between "opcache clear" (CLI) and "opcache read/write" (from http traffic) can be avoided by not doing manual clears, but instead using time-based validation (e.g. every minute).

  • Good: Key point being that validation does not happen independently or for all files. But rather it's on-demand in http requests, and always involves the same kind of write as when it is populated normally. Thus avoiding the bug.
  • Bad: Of course, time-based validation means opening us back up to race conditions mid-sync, during deployments.
  • Bad: PHP still performs clears automatically even if we don't. It happens whenever it reaches the configured memory limit for opcache, which would exhibit the same bug (ouch).

There is apparently a opcache.protected_memory=1 setting, which should make PHP 7 not boldly continue when there are flipped bits / mangled strings / missing files as if everything is fine, but instead detect the issue and fatal out with a segfault.

  • Good: This would be desirable and if this works as advertised would rule out all concerns regarding privacy, security and database correctness.
  • Bad: There might be a perf cost. To be determined, but unless we solve it another way, but probably don't have much choice.

Some suggest disabling opcache.file_cache=. "if you're using Opcache file backed caching, it's worth trying to run with it disabled." and "we're running just fine with file based caching disabled".

A patch was referenced, which landed in php-master, that might fix some or all of this issue with strings being changed after opcache clearing. A backport has been asked for, but no news on that yet. – link

(Added another two unique exception from two other servers, both had a duration of 10 minutes as well. Logging here for search purposes.)

This comment was removed by Krinkle.

Ok, we got different takeaways from that ticket (that I did read in the past). Let's go in order:

Opcache size one of the main reasons this happens is when the interned strings buffer gets filled up during the cache clear/refill. Or at least, multiple people reported raising the interned strings buffer did solve the issue. I have to add that from what I see on the servers, we have no hash/oom restarts:

mw1273:~$ php7adm /opcache-info | jq .opcache_statistics.oom_restarts
0

so it doesn't look like the opcache gets full, ever, even in bursts. I will jack it up significantly anyways, see if this helps

Opcache resets are the actual cause of the corruption. Most people try to avoid them, but we really can't until we change radically how we deploy our code. The alternative is just going with revalidation and see if that cause weird race conditions or not during deployments. I think it's pretty much guaranteed to, but I think we could get away with not doing cache resets at least on smaller deploys that happen via sync-file.

Protective opcache settings I don't think we can really use opcache.protected_memory=1, but I'll run some tests to compare the performance with and without it. We don't use file-backed opcache at the moment, but I somewhat think it would protect us rather than create issues.

What should we do

  • bump the opcache size, the string buffer size as well
  • benchmark the safety options listed above
  • Disable completely opcache resets for sync-file
  • Try to run one server without opcache resets, only with invalidation for a week
  • Find a reliable way to detect such memory corruption issues
  • Add to scap a step to verify such detection and act in case things are broken.

It's clear to me that a satifsying solution to this problem on the long term ecompasses reorganizing our code, our configuration, and how we deploy both. Food for thought:
a not-so-novel approach used at etsy https://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/ (basically they deploy the code alternatively to two directories. This is completely impossible for us with our current deployment practices.

Mentioned in SAL (#wikimedia-operations) [2019-05-30T09:31:34Z] <_joe_> depooling mw1261 for benchmarking for T224491

after even the first simplest tests, it's absolutely clear to me that running with opcache.protect_memory is turned on, latencies increase by 200% on all pages. It's absolutely unviable to use in production.

More errors, that I presume are due to corruption due to being impossible based on the actual code on disk, and due to being limited to one server only:

[{exception_id}] {exception_url} Error from line 2246 of /srv/mediawiki/php-1.34.0-wmf.7/includes/Title.php: Undefined class constant 'RIGOR_SECURE'
hostmw1321
Timestamp2019-05-30 16:08 - 17:52 (2 hours)

Possibly still on-going?

Mentioned in SAL (#wikimedia-operations) [2019-05-30T18:11:32Z] <Krinkle> mw1321 php7.2 shows signs of corruption for over 2 hours – https://phabricator.wikimedia.org/T224491#5224464

Mentioned in SAL (#wikimedia-operations) [2019-05-30T18:12:54Z] <Krinkle> Running php7adm /opcache-free on mw1348 and mw1321, T224491

The mw1321 issue seems to have recovered (based on querying host:mw1321 on the mediawiki-new-errors Logstash board).

No Icinga or Pyball alerts were fired for mw1321, because it hit a code path relating to MW PermissionManager that was relatively rare.


The issue with mw1348 (api server) is potentially related to it having recently been enabled for 100% traffic to PHP 7.2. It's corruption resulted in an error that happened to be both fatal in nature, and in a popular code path. Thus, it affected Icinga alerts:

<icinga-wm> PROBLEM - PHP7 rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time
<icinga-wm> PROBLEM - HHVM rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time 
<icinga-wm> PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.011 second 
<icinga-wm> PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response
`

(I do wonder why "HHVM rendering" shows up, presumably either an unlikely coincidence, or some level of separation that isn't recognised by the way the request is made.)

These recovered after a manual opcache clear.

@Joe I'm excluding php7.2 from all Logstash monitoring related to MediaWiki for the time being. We just have hundreds and hundreds of unique errors, exceptions, and fatals all on different servers and for varying periods of time (some for many hours). It's making it completely unusable, and I'm not going to track each one down to see if they're false negatives or not.

I'm not sure, but it seems like at any given moment there are at least two or three app servers with corrupt php 7.2 code running on live traffic. I've restarted a few time of them but giving up for now. It's whack-a-mole :-)

I've added a php-version breakdown to the Logstash dashboards I use most, to quickly see if a filtered error is specific to php7.2 or not. But also to see how it is divided in terms of overall volume.

  • mediawiki-errors

This shows all events from type:mediawiki channel:(error OR exception OR fatal). For the below, I amended it to filter out channel:error (php warnings) per [1]. Here, php 7.2 is responsible for 56% of fatal exceptions and fatal errors overall (except for execution timeouts, which the dashboard filters out first).

Screenshot 2019-05-30 at 19.39.23.png (594×644 px, 53 KB)

  • mediawiki-new-errors

This is based on 'mediawiki-errors' but with a lot of filters for known issues tracked at Wikimedia-production-error. Same as above otherwise, but with most of the known issues with our code excluded. Here, php 7.2 is responsible for 62% of fatal exceptions and fatal errors.

Screenshot 2019-05-30 at 19.41.53.png (600×624 px, 52 KB)

Charts based on last 24 hours of data.

I don't know the exact portion of traffic directed at php 7.2, but I assume it is still below 50%. We have A/B test at 20% (which means less than 20% of actual user traffic, given it only applies to traffic from web browsers, and only to secondary views within a browsing session), we have 1 api server at 100%, and some jobs.

So, I am disabling the opcache resets for now, and putting timestamp validation to be short enough that it will be hardly a problem (but I might be wrong and we'll discover fast).

For the future, we will need to do more atomic releases and this will probably include doing something as brutal as depooling/restarting/repooling application servers, and change the way we do deployments.

I will still raise the opcache memory and interned strings buffer, but I see no reason to try to fight with windmills and to fix something that's clearly broken in quite subtle ways.

Change 513549 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scap: stop resetting opcache on php7.2

https://gerrit.wikimedia.org/r/513549

Change 513549 merged by Giuseppe Lavagetto:
[operations/puppet@production] scap: stop resetting opcache on php7.2

https://gerrit.wikimedia.org/r/513549

Change 513571 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mw1348: increase opcache revalidation frequency

https://gerrit.wikimedia.org/r/513571

Change 513571 merged by Giuseppe Lavagetto:
[operations/puppet@production] mw1348: increase opcache revalidation frequency

https://gerrit.wikimedia.org/r/513571

FWIW - the percentage of errors coming from php-fpm servers was including timeouts, given the timeout message is different there and emitted from excimer. Those can be excluded by filtering by exception.class = WMFTimeoutException.

Even then, the percentage of errors coming from php7 was statistically higher than expected.

Since I disabled opcache resets, let's see what happens over the weekend and the first day with further releases.

Change 513598 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: use 10 seconds as revalidate frequency everywhere

https://gerrit.wikimedia.org/r/513598

Change 513598 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: use 10 seconds as revalidate frequency everywhere

https://gerrit.wikimedia.org/r/513598

Mentioned in SAL (#wikimedia-operations) [2019-05-31T14:20:20Z] <_joe_> rolling restart of php-fpm across production to pick up the shorter revalidate frequency for T224491

@Niedzielski noticed another, pretty similar issue: T225018: [Bug] TypeError in PopupsContext. System fails with PHP Fatal error: Uncaught TypeError: Argument 4 passed to Popups\PopupsContext::__construct() must implement interface Popups\EventLogging\EventLogger, instance of Popups\EventLogging\MWEventLogger given: but the Popups\EventLogging\MWEventLogger implements Popups\EventLogging\EventLogger so most probably it's a broken opcache issue.

@Niedzielski noticed another, pretty similar issue: T225018: [Bug] TypeError in PopupsContext. System fails with PHP Fatal error: Uncaught TypeError: Argument 4 passed to Popups\PopupsContext::__construct() must implement interface Popups\EventLogging\EventLogger, instance of Popups\EventLogging\MWEventLogger given: but the Popups\EventLogging\MWEventLogger implements Popups\EventLogging\EventLogger so most probably it's a broken opcache issue.

No that's completely unrelated to opcache corruption. We're not resetting opcache anymore, even. Those 2 error messages come from one specific server, mw1248, exactly at the time it was being turned on after a reboot. I can imagine some sort of race condition happening there.

Change 514673 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] WIP: mediawiki: check and restart php7 if needed

https://gerrit.wikimedia.org/r/514673

No that's completely unrelated to opcache corruption. We're not resetting opcache anymore, even. Those 2 error messages come from one specific server, mw1248, exactly at the time it was being turned on after a reboot. I can imagine some sort of race condition happening there.

What kind of race condition would lead to PHP loading a class being loaded from a file in a way that seems to corrupt or selectively skip the part of the line where the "extends" / "implements" keywords are?

namespace Popups\EventLogging;

class MWEventLogger implements EventLogger {

It appears this was parsed to effectively be class MWEventLogger implements SomethingThatIsNotEventlogger {

Change 514673 abandoned by Effie Mouzeli:
WIP: mediawiki: check and restart php7 if needed

Reason:
We'll go with 519207 instead

https://gerrit.wikimedia.org/r/514673

Joe claimed this task.

The immediate problem seems to be resolved given we've not see corruptions since we stopped resetting the opcache status on live servers.

I consider the bug resolved because we now have daily crons opportunistically restarting php-fpm if needed, hence the immediate fallout from this problem has been resolved.

Yeah

Does T224857: Enhance MediaWiki deployments for support of php7.x also block that or is this work sufficient?

They are related; due to corruptions we started T224857 and stalled PHP7 rollout. Cron restarts is a bandaid untill T224857 is resolved while moving forward with the rollout.

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:59:49Z] <Krinkle> mw1342 is generating strange PHP erros (php7 only), ref T224491

Mentioned in SAL (#wikimedia-operations) [2019-07-12T01:01:42Z] <Krinkle> mw1342 generated some ~ 11,500 additional PHP errors over a 4 hour period (18:00-22:30 UTC), ref T224491

The cause of these notices was code like this:

setThing( array $list ) {
  $key = implode( '|', $list);
  $this->data[ $key ] = $value;

getThing( array $list ) {
  $key = implode( '|', $list);
  return $this->data[ $key ];
}

One of the '|' string was corrupted. Thus, despite being fully covered by test, and with explicit run-time calls that ensure the key exists, it could still be missing. In theory, this could've been avoided if the code followed a strict rule of "no magic strings" and "no magic numbers" anywhere in code (putting all such values in constants). But, that's not the norm in our code. And even things could fail catastrophically if constant A becomes corrupted to match constant B, as they'd clash and lead to an unrelated (or also undefined) key.

@Krinkle, any chance you can give a link to a few of those errors with their stacktrace in logstash? I'd like to look at a few specific entries.

Frankly without further information on what happened, how it was debugged, and how it was solved, I'm not inclined to consider this problem related to opcache corruption necessarily.

Logstash query for the error in question:

https://logstash.wikimedia.org/goto/8844e41805b83833f5266d4f98fc524e

exception.trace:resourceloader AND "Undefined index"

[ type:mediawiki ] [ channel.raw: "error" ]

Only seen from PHP 7 requests for Special:BlankPage?force_php7=1. The code path is quite deterministic and I'm unable to reproduce this on either HHVM or PHP 7 on-demand. It appearing to be an impossible condition to reach in a program, and that it doesn't happen on demand, and not from other servers around the same time whenever it happens, suggests to be that it is likely corruption. It happened to mw1342 July 12, and also in recent days again on mw1276, mw1312, and mw1347.

I agree it could be due to something else. I would say I've not seen other corruptions happening, but I've also stopped actively looking for them because it became a full-time all-consuming job. This is just one I ran into during regular work.

From IRC conversation, Krinle says he ran

php7adm /opcache-free

and the problem immediately went away in the specific instance.

@Krinkle I'm suspicious that this can fail on different days with different versions of MediaWiki and on different hosts but in the exact same spot in the code. I would not expect opcache issues to lead to this sort of predictable breakage in the general case. Can we get the values of what MediaWiki thinks are in there for $batchKey, $this->titleInfo and $pageNames at the point where that exception is thrown? Would it be awful to put a try catch around that little stanza, log an entry with those values and then raise?

Logstash query for the error in question:

https://logstash.wikimedia.org/goto/8844e41805b83833f5266d4f98fc524e

exception.trace:resourceloader AND "Undefined index"

[ type:mediawiki ] [ channel.raw: "error" ]

Only seen from PHP 7 requests for Special:BlankPage?force_php7=1. The code path is quite deterministic and I'm unable to reproduce this on either HHVM or PHP 7 on-demand. It appearing to be an impossible condition to reach in a program, and that it doesn't happen on demand, and not from other servers around the same time whenever it happens, suggests to be that it is likely corruption. It happened to mw1342 July 12, and also in recent days again on mw1276, mw1312, and mw1347.

I agree it could be due to something else. I would say I've not seen other corruptions happening, but I've also stopped actively looking for them because it became a full-time all-consuming job. This is just one I ran into during regular work.

Those are monitoring requests, so I guess the problem was reported by icinga as well.

If as @ArielGlenn said the opcache was freed and the problem went away, that confirms the problem had to do with opcache corruption. How that happened, though, is hard to determine unless I get more information.

Did this happen around a deployment? What was the opcache status (free bytes etc) at the time of the issue?

Worth noting that we have a known GC error in PHP 7.2, which is also 100% reproducible: T228346: PHP 7.2 garbage collector segfault. Although in our case the GC error led to a segfault, I don't think it's impossible that a GC error would cause corruption.

[…]
Only seen from PHP 7 requests for Special:BlankPage?force_php7=1. The code path is quite deterministic and I'm unable to reproduce this on either HHVM or PHP 7 on-demand. […]

Those are monitoring requests, so I guess the problem was reported by icinga as well.

No, because it did not result in an exception to be thrown anywhere. Instead, it led the code down an impossible path leading to valid-looking but wrong side-effects (e.g. storing potentially bad stuff in memcached and in module_deps). Strings changing in memory is not something we could reasonably protect ourselves against. It sometimes causes another condition to break and thus throw some unrelated exception, but I think that's relatively rare.

The corruption that causes entire classes or files to go missing, would generally lead to a fatal. Maybe not on a request like BlankPage (which exercises relatively little code), but I would indeed expect such corruption to break enough traffic with an exception that it can be detected by PyBall as an unhealthy backend.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM

I will close this bug as resolved. We've not seen a recurrence in a long time, and I think we engineered around this issue by doing rolling restarts.