Page MenuHomePhabricator

PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.)
Open, MediumPublic

Description

NOTE: See also tracking task for corruptions in 2019 was T224491.

Diagnosing opcache corruption

Please see https://wikitech.wikimedia.org/wiki/User:CDanis/Diagnosing_opcache_corruption

An old error report for the case that began this tracking task
Error: Call to undefined method Monolog\Logger::get()

from /srv/mediawiki/php-1.35.0-wmf.19/includes/libs/objectcache/wancache/WANObjectCache.php:1303
trace
#0 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(558): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#1 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(414): CentralAuthUser->loadFromCache()
#2 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(589): CentralAuthUser->loadState()
#3 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(686): CentralAuthUser->getId()
#4 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthHooks.php(917): CentralAuthUser->exists()
#5 /srv/mediawiki/php-1.35.0-wmf.19/includes/Hooks.php(174): CentralAuthHooks::onGetUserBlock(User, string, NULL)
#6 /srv/mediawiki/php-1.35.0-wmf.19/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#7 /srv/mediawiki/php-1.35.0-wmf.19/includes/block/BlockManager.php(167): Hooks::run(string, array)
#8 /srv/mediawiki/php-1.35.0-wmf.19/includes/user/User.php(1800): MediaWiki\Block\BlockManager->getUserBlock(User, WebRequest, boolean)
#9 /srv/mediawiki/php-1.35.0-wmf.19/includes/user/User.php(2119): User->getBlockedStatus(boolean)
#10 /srv/mediawiki/php-1.35.0-wmf.19/includes/block/BlockManager.php(470): User->getBlock()
#11 /srv/mediawiki/php-1.35.0-wmf.19/includes/MediaWiki.php(755): MediaWiki\Block\BlockManager->trackBlockWithCookie(User, WebResponse)
#12 /srv/mediawiki/php-1.35.0-wmf.19/includes/api/ApiMain.php(548): MediaWiki::preOutputCommit(DerivativeContext)
#13 /srv/mediawiki/php-1.35.0-wmf.19/includes/api/ApiMain.php(493): ApiMain->executeActionWithErrorHandling()
#14 /srv/mediawiki/php-1.35.0-wmf.19/api.php(84): ApiMain->execute()
#15 /srv/mediawiki/w/api.php(3): require(string)
#16 {main}
  • Request ID: XkWMYwpAIDkAALfuQmcAAACH
  • Host: mw1345
  • Request Method: GET
  • Request URL: commons.wikimedia.org/w/api.php?action=query&generator=allimages&prop=imageinfo&…
  • phpversion: 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1
  • timestamp: 2020-02-13T17:51:28+00:00

Logstash record:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.02.13/mediawiki?id=AXA_rVssh3Uj6x1zToLY

Source code

CentralAuthUser.php@1.35.0-wmf.19

		$cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
		$data = $cache->getWithSetCallback( # line 540
			$this->getCacheKey( $cache ),
			$cache::TTL_DAY,
			function ( $oldValue, &$ttl, array &$setOpts ) {
				$dbr = CentralAuthUtils::getCentralReplicaDB();
				$setOpts += Database::getCacheSetOptions( $dbr );

				$this->loadFromDatabase();
				$this->loadAttached();
				$this->loadGroups();

				$data = [];
				foreach ( self::$mCacheVars as $var ) {
					$data[$var] = $this->$var;
				}

				return $data;
			},
			[ 'pcTTL' => $cache::TTL_PROC_LONG, 'version' => self::VERSION ] # line 558
		); # line 559

WANObjectCache.php@1.35.0-wmf.19

	final public function getWithSetCallback(
		$key, $ttl, $callback, array $opts = [], array $cbParams = []
	) {
		$version = $opts['version'] ?? null;
		$pcTTL = $opts['pcTTL'] ?? self::TTL_UNCACHEABLE;
		$pCache = ( $pcTTL >= 0 )
			? $this->getProcessCache( $opts['pcGroup'] ?? self::PC_PRIMARY )  # line 1296
			: null;

		if ( $pCache && $this->callbackDepth == 0 ) {
			$cached = $pCache->get( $this->getProcessCacheKey( $key, $version ), $pcTTL, false );   # line 1303
			if ( $cached !== false ) {
				$this->logger->debug( "getWithSetCallback($key): process cache hit" );  # line 1305
				return $cached;
			}
		}
Notes

Stuff is very messed up here:

  1. The error Call to undefined method Monolog\Logger::get() is attribued to line WANObjectCache.php:L1303, which is calling $pCache->get() where $pCache is an instance of MapCacheLRU. It is meant to be calling MapCacheLRU->get().
  2. For some reason the engine is thinking about Monolog\Logger, which is strange. For what it's worth, there is an instance of that object "near by" – two lines down on line 1305. Could it be confusing the two??
  3. The stack trace says WANObjectCache->getWithSetCallback was called from CentralAuthUser.php:L558. This is also odd, because that line is neither the start nor the end of the function call. See snippet above. Looks like this may be off by a few as well.

Might be opcache corruption.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I looked at the code base and couldn't find any instance that EntityLinkFormatterFactory object would be set as property id of an statement. Unless some eval magic is involved and we would have bad lua in clients invoking this but I highly highly doubt that.

Here is another mysterious mis-call (Logstash single document)

  • timestamp: 2020-02-14T10:04:16
  • reqId: XkZwYwpAICkAAAmWLyEAAAAL
  • host: mw1320
PHP Fatal Error: Uncaught Error: Call to a member function isHandling() on integer

in /srv/mediawiki/php-1.35.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php:304
#0 /srv/mediawiki/php-1.35.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php(628): Monolog\Logger->addRecord(100, 'getWithSetCallb...', Array)
#1 /srv/mediawiki/php-1.35.0-wmf.19/includes/libs/objectcache/wancache/WANObjectCache.php(1305): Monolog\Logger->debug('getWithSetCallb...')

The code in question:

WANObjectCache
setLogger( LoggerInterface $logger ) {
	$this->logger = $logger;
}
getWithSetCallback(  ) {
	
	$this->logger->debug( "getWithSetCallback($key): process cache hit" );
Monolog/Logger
public function debug($message, array $context = array()) {
  return $this->addRecord(static::DEBUG, $message, $context);
}
public function addRecord(int $level, string $message, array $context = array()) {
   $this->handlers = [  ]; // @var HandlerInterface[]
    

    $levelName = static::getLevelName($level); // @var string
    reset($this->handlers);
    while ($handler = current($this->handlers)) {
        if ($handler->isHandling(array('level' => $level))) { # BOOM

In this case it looks like PHP has somehow confused $handler (HandlerInterface) with $level (integer).

Here is another mysterious mis-call (Logstash single document)

  • timestamp: 2020-02-14T10:04:16
  • reqId: XkZwYwpAICkAAAmWLyEAAAAL
  • host: mw1320
PHP Fatal Error: Uncaught Error: Call to a member function isHandling() on integer

in /srv/mediawiki/php-1.35.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php:304
#0 /srv/mediawiki/php-1.35.0-wmf.19/vendor/monolog/monolog/src/Monolog/Logger.php(628): Monolog\Logger->addRecord(100, 'getWithSetCallb...', Array)
#1 /srv/mediawiki/php-1.35.0-wmf.19/includes/libs/objectcache/wancache/WANObjectCache.php(1305): Monolog\Logger->debug('getWithSetCallb...')

It does not appear to align with an automatic restart (the next one was several days later), and also does not align with a near-zero amount of opcache memory left. Odd?

Screenshot 2020-04-28 at 17.01.41.png (1×2 px, 237 KB)

Error message
Fatal error:
Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use
Notes

A spike of over 7000 fatals in the span of one hour, on the mw1379 server.

It lasted approximately from 2020-06-01 14:00 to 14:50 UTC.

From Logstash mediawiki-errors:

Screenshot 2020-06-02 at 02.00.58.png (598×1 px, 62 KB)

Impact

All wiki web requests on this server were presumably aborted with a system error.

Copying here so that this task remains complete for analysis of the problem, separate from T253673 which is about fixing the root cause. (Controlling the noise)

Today at 2020-07-08T14:00:12, with no deployments happening, mw1346 started generating exceptions at high rate:

/srv/mediawiki/php-1.35.0-wmf.39/includes/config/GlobalVarConfig.php:53 GlobalVarConfig::get: undefined option: 'MinervaOverflowInPageActhons'

#0 /srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php(131): GlobalVarConfig->get(string)
#1 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MinervaHooks::onMobileFrontendFeaturesRegistration(MobileFrontend\Features\FeaturesManager)
#2 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#3 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/Features/FeaturesManager.php(43): MediaWiki\HookContainer\HookContainer->run(string, array)
#4 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/ServiceWiring.php(55): MobileFrontend\Features\FeaturesManager->useHookToRegisterExtensionOrSkinFeatures()
#5 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(451): Wikimedia\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#6 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(419): Wikimedia\Services\ServiceContainer->createService(string)
#7 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/MobileFrontendHooks.php(1115): Wikimedia\Services\ServiceContainer->getService(string)
#8 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MobileFrontendHooks::onMakeGlobalVariablesScript(array, OutputPage)
#9 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#10 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookRunner.php(2516): MediaWiki\HookContainer\HookContainer->run(string, array)
#11 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3386): MediaWiki\HookContainer\HookRunner->onMakeGlobalVariablesScript(array, OutputPage)
#12 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3035): OutputPage->getJSVars()
#13 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3056): OutputPage->getRlClient()
#14 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(82): OutputPage->headElement(SkinApi)
#15 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(57): SkinMustache->getTemplateData()
#16 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinTemplate.php(141): SkinMustache->generateHTML()
#17 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(2616): SkinTemplate->outputPage()
#18 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatBase.php(333): OutputPage->output()
#19 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatRaw.php(82): ApiFormatBase->closePrinter()
#20 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(1834): ApiFormatRaw->closePrinter()
#21 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(608): ApiMain->printResult(integer)
#22 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(532): ApiMain->handleException(ConfigException)
#23 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(496): ApiMain->executeActionWithErrorHandling()
#24 /srv/mediawiki/php-1.35.0-wmf.39/api.php(89): ApiMain->execute()
#25 /srv/mediawiki/php-1.35.0-wmf.39/api.php(44): wfApiMain()
#26 /srv/mediawiki/w/api.php(3): require(string)
#27 {main}

Note the spelling of MinervaOverflowInPageActhons.

Pybal quickly depooled the host, but probably monitoring kept querying the server and generating the exceptions.

@Joe did php7adm /opcache-free on mw1346 and the exceptions cleared.

Interestingly, according to the opcache metadata, the file where the error was (/srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php) was in opcache since 2 weeks, which isn't great because it means nothing really caused this issue:

  • no deploy
  • no opcache invalidation by chance
  • no php restarts

Happened twice recently on

host: wtp1040
timestamp: 2020-07-08T18:06:14
reqid: 1e314c17-ae0b-4826-9dad-ff4177672966

[1e314c17-ae0b-4826-9dad-ff4177672966] PHP Fatal error: Uncaught Error: Call to undefined method Monolog\Logger::hasField() in /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php:629 in /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php on line 629

exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php(284): MessageCache->isLanguageLoaded('en')
#1 /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php(1091): MessageCache->load('en')
#2 /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php(1018): MessageCache->getMsgFromNamespace('Internalerror', 'en')
#3 /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php(988): MessageCache->getMessageForLang(Object(LanguageEn), 'internalerror', true, Array)
#4 /srv/mediawiki/php-1.35.0-wmf.39/includes/cache/MessageCache.php(930): MessageCache->getMessageFromFallbackChain(Object(LanguageEn), 'internalerror', true)
#5 /srv/mediawiki/php-1.35.0-wmf.39/includes/language/Message.php(1302): MessageCache->get('internalerror', true, Object(LanguageEn))
#6 /srv/mediawiki/php-1.35.0-wmf.39/includes/language/Message.php(860): Message
Krinkle renamed this task from PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) to PHP7 corruptions (Call on wrong object, Call to undefined method, etc.).Aug 29 2020, 6:33 PM

Another corruption:

Uncaught Psr\Log\InvalidArgumentException: Level "400" is not defined, use one of: 0, 1, 2 in /srv/mediawiki/php-1.36.0-wmf.2/vendor/monolog/monolog/src/Monolog/Logger.php:516

#0 /srv/mediawiki/php-1.36.0-wmf.2/vendor/monolog/monolog/src/Monolog/Logger.php(298): Monolog\Logger::getLevelName(400)
#1 /srv/mediawiki/php-1.36.0-wmf.2/vendor/monolog/monolog/src/Monolog/Logger.php(712): Monolog\Logger->addRecord(400, '[{exception_id}...', Array)
#2 /srv/mediawiki/php-1.36.0-wmf.2/includes/exception/MWExceptionHandler.php(346): Monolog\Logger->error('[{exception_id}...', Array)
#3 [internal function]: MWExceptionHandler::handleFatalError()
#4 {main}
  • Request ID: 56ec2797-9aa6-4203-9fd7-9ab1bbc04078
  • Request URL: https://de.wikibooks.org/w/index.php?title=…&diff=…&oldid=0
  • Timestamp: 2020-07-30T11:04
  • Host: mw1275 (appserver)
  • phpversion: 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1

The level 400 very clearly is defined beyond resonable doubt (vendor src).

Krinkle renamed this task from PHP7 corruptions (Call on wrong object, Call to undefined method, etc.) to PHP7 corruption reports in 2020 (Call on wrong object, etc.).Sep 11 2020, 6:19 PM
Krinkle updated the task description. (Show Details)
  • timestamp: 2020-10-01T16:15:49
  • host: mw2290
  • message:
[ab3d143d-d371-41e4-ab74-ae92255e704f] /w/api.php?action=query&prop=revisions&revids=981318678&rvprop=content&rvparse=1&format=xml   Error from line 1369 of /srv/mediawiki/php-1.36.0-wmf.10/includes/libs/objectcache/wancache/WANObjectCache.php: Call to undefined method Monolog\Logger::get()

We've seen this exact one before at the start of this task. The fact that it tends to gravitate toward methods of this name, or this class, or in this file could be something. The file has changed numerous times since, so it's not likely to only be dependent on random chance or the file leading to a specific hash, but rather something with the specific sequence of statements in this file...

As an additional data point, note that in the merged T264241: Class 'LathMathML' not found task from yesterday, it was a different class / file.

Yeah, that's a classic string one-byte flip issue which does seem to be random and rarely seen twice the same way.

Some upstream bugs for the record from the old task, to be copied into the task description:

  • timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
  • host: mw2252
  • message:

[2b171d8b-48ec-480d-b7a4-187dd3af259c] /w/api.php?titles=Image%3ANorrlands_nation_Nya_entr%C3%A9n2.jpg&iiprop=url&iiurlwidth=120&iiurlheight=120&prop=imageinfo&format=json&action=query TypeError from line 395 of /srv/mediawiki/php-1.36.0-wmf.13/extensions/WikibaseCirrusSearch/src/Hooks.php: Return value of Wikibase\Search\Elastic\Hooks::getWBCSConfig() must be an instance of Vikibase\Search\Elastic\WikibaseSearchConfig, instance of Vikibase\Search\Elastic\WikibaseSearchConfig returned

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

No this is the actual corruption.

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

No this is the actual corruption.

Thanks. The "string one-byte flip" - which Krinkle mentioned earlier - has become self-aware and is punning us.

  • timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
  • host: mw2252

Please note: this was noticed pretty early but the server was not depooled by mistake.

Noting for the record in all cases where there's a bogus changed character in a string, the bad character is always one less than what it should be (i.e., it's not a bit flip, it's a decrement).

That happened during 1.36.0-wmf.25 promotion to testwiki. We then had three servers showing all the symptoms of suffering from an opcache corruption which got cleared by restarting php fpm.

The UTC timeline is:

  • 18:59 scap finished
  • 19:34 - 19:47 mw1362 throws several kind of exceptions at a high rate:
    • Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
    • Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found
    • Error from line 119 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/client/includes/DataAccess/DataAccessSnakFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found

The first one does not make sense code wise. The second is a class name had a bit flip leading to a FE transposition (that is a known symptom as well).

  • 19:41 mw1362: apache restarted
  • 19:47 mw1362 depooled (clears the log spam since it is no more receiving traffic)
  • 19:54 - 20:20 mw1344 throws: Error from line 599 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey()
  • 20:04 mw1344: apache restarted
  • 20:20 mw1344: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)
  • 20:28 mw1362 repooled
  • 20:28 - 20:38 mw1362 throws: Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
  • 20:38 mw1362: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

mw1362 started misbehaving 35 minutes AFTER scap has completed. It is probably related, but such delay is unknown to me at least.


A theory we had over IRC:

We had a puppet change merged at 19:19 which drops ServerAdmin webmaster@wikimedia.org from the Apache config: caa6f65dd48a12b38d5ae7b63091f9662e4c3ef4 . https://gerrit.wikimedia.org/r/c/operations/puppet/+/651649 . That is the sole known change that happened just before mw1362 started misbehaving.

As I understand it the Apache config change is rolled via Puppet which randomly run accross the fleet, that then causes Apache2 to reload. That might explain why mw1344 started misbehaving 20 minutes later (19:54). I looked at /etc/cron.d/puppet on those servers to compare it with the start of the issue:

ServerIssueCron
mw136219:3419,49
mw134419:5410,40

So maybe it is not related. I don't have access to the Apache/Puppet/Systemd logs to identify whether that is related.

Else well that is unknown to me, do we ever had an opcache corruption randomly appearing without a code deployment?

Do we need to update the title / create a 2021 task?

Krinkle renamed this task from PHP7 corruption reports in 2020 (Call on wrong object, etc.) to PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.).Jan 5 2021, 9:44 PM

That happened during 1.36.0-wmf.25 promotion to testwiki. We then had three servers showing all the symptoms of suffering from an opcache corruption which got cleared by restarting php fpm.

The UTC timeline is:

  • 18:59 scap finished
  • 19:34 - 19:47 mw1362 throws several kind of exceptions at a high rate:
    • Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
    • Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found
    • Error from line 119 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/client/includes/DataAccess/DataAccessSnakFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found

The first one does not make sense code wise. The second is a class name had a bit flip leading to a FE transposition (that is a known symptom as well).

  • 19:41 mw1362: apache restarted
  • 19:47 mw1362 depooled (clears the log spam since it is no more receiving traffic)
  • 19:54 - 20:20 mw1344 throws: Error from line 599 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey()
  • 20:04 mw1344: apache restarted
  • 20:20 mw1344: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

Regarding 1362 and 1344, both servers reached their max cached keys *way after* scap. From the graphs it seems that opcache was cleared and the problems started. The number of keys during the scap run was below the threshold (<30k keys), so php-fpm was not restarted.

image.png (1×3 px, 223 KB)
mw1362

image.png (1×3 px, 239 KB)
mw1344

  • 20:28 mw1362 repooled
  • 20:28 - 20:38 mw1362 throws: Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
  • 20:38 mw1362: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

I don't know if this issue is related or not, if php-fpm on mw1362 was not restarted while the host was depooled, then it probably is.

To my understanding, issues arise when opcache reaches its max cached keys and restarts itself, and it is only fixable with a php-fpm restart. We can lower the threshold even more or/and maybe increase the number of max cached keys. It appears to be the case in the last opcache corruptions (this occurrence + the ones back in October), so I suspect those are the conditions under which this problem surfaces. Lastly, I think this issue has been happening only to API servers, but I will have to verify this.

Related: T253673#6569013, T264362

@jijiki thanks for the investigation! We were kind of wondering whether the Apache reload might have triggered the opcache issue which is ruled out. Thanks !

I'm seeing this several times a week and have for several months. I haven't reported it before since it's not essential prod code and we've been focussing on mitigation rather than further gathering/investigation, but figured I might as well report it here since other people keep finding it and asking if there's a task for it:

PHP Fatal error: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in /srv/mediawiki/src/XWikimediaDebug.php on line 16

reqId: "d4747ce3-deae-4ac7-bcc9-6de99b4adb24"
host: "wtp1039"
timestamp: "2021-01-10T10:33:31.480676+00:00"

affects both mw* and wpt*.

Change 657398 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

Change 657398 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

@Krinkle: Robert doesn't seem to have been active for a few years nor involved in this task. Did you mean to assign it to him?

I'm admin and I can't disable it....

I'm admin and I can't disable it....

Phabricator is good at that. I guess it'll have to be CLI. Can't see at a glance any other things that trigger messed up via https://phabricator.wikimedia.org/maniphest/query/fA5RcoBdI.Po/#R

I'm admin and I can't disable it....

You should ask the members of the "phabricator-admin" shell admin group. (aklapper, gjg). They have sudo to run various commands to delete (aka. destroy) things via command line. That is higher level admin than web UI admin.