Page MenuHomePhabricator

PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Closed, ResolvedPublic

Description

NOTE: See also tracking task for corruptions in 2019 was T224491.

Diagnosing opcache corruption

Please see https://wikitech.wikimedia.org/wiki/User:CDanis/Diagnosing_opcache_corruption

An old error report for the case that began this tracking task
Error: Call to undefined method Monolog\Logger::get()

from /srv/mediawiki/php-1.35.0-wmf.19/includes/libs/objectcache/wancache/WANObjectCache.php:1303
trace
#0 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(558): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#1 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(414): CentralAuthUser->loadFromCache()
#2 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(589): CentralAuthUser->loadState()
#3 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthUser.php(686): CentralAuthUser->getId()
#4 /srv/mediawiki/php-1.35.0-wmf.19/extensions/CentralAuth/includes/CentralAuthHooks.php(917): CentralAuthUser->exists()
#5 /srv/mediawiki/php-1.35.0-wmf.19/includes/Hooks.php(174): CentralAuthHooks::onGetUserBlock(User, string, NULL)
#6 /srv/mediawiki/php-1.35.0-wmf.19/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#7 /srv/mediawiki/php-1.35.0-wmf.19/includes/block/BlockManager.php(167): Hooks::run(string, array)
#8 /srv/mediawiki/php-1.35.0-wmf.19/includes/user/User.php(1800): MediaWiki\Block\BlockManager->getUserBlock(User, WebRequest, boolean)
#9 /srv/mediawiki/php-1.35.0-wmf.19/includes/user/User.php(2119): User->getBlockedStatus(boolean)
#10 /srv/mediawiki/php-1.35.0-wmf.19/includes/block/BlockManager.php(470): User->getBlock()
#11 /srv/mediawiki/php-1.35.0-wmf.19/includes/MediaWiki.php(755): MediaWiki\Block\BlockManager->trackBlockWithCookie(User, WebResponse)
#12 /srv/mediawiki/php-1.35.0-wmf.19/includes/api/ApiMain.php(548): MediaWiki::preOutputCommit(DerivativeContext)
#13 /srv/mediawiki/php-1.35.0-wmf.19/includes/api/ApiMain.php(493): ApiMain->executeActionWithErrorHandling()
#14 /srv/mediawiki/php-1.35.0-wmf.19/api.php(84): ApiMain->execute()
#15 /srv/mediawiki/w/api.php(3): require(string)
#16 {main}
  • Request ID: XkWMYwpAIDkAALfuQmcAAACH
  • Host: mw1345
  • Request Method: GET
  • Request URL: commons.wikimedia.org/w/api.php?action=query&generator=allimages&prop=imageinfo&…
  • phpversion: 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1
  • timestamp: 2020-02-13T17:51:28+00:00

Logstash record:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.02.13/mediawiki?id=AXA_rVssh3Uj6x1zToLY

Source code

CentralAuthUser.php@1.35.0-wmf.19

		$cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
		$data = $cache->getWithSetCallback( # line 540
			$this->getCacheKey( $cache ),
			$cache::TTL_DAY,
			function ( $oldValue, &$ttl, array &$setOpts ) {
				$dbr = CentralAuthUtils::getCentralReplicaDB();
				$setOpts += Database::getCacheSetOptions( $dbr );

				$this->loadFromDatabase();
				$this->loadAttached();
				$this->loadGroups();

				$data = [];
				foreach ( self::$mCacheVars as $var ) {
					$data[$var] = $this->$var;
				}

				return $data;
			},
			[ 'pcTTL' => $cache::TTL_PROC_LONG, 'version' => self::VERSION ] # line 558
		); # line 559

WANObjectCache.php@1.35.0-wmf.19

	final public function getWithSetCallback(
		$key, $ttl, $callback, array $opts = [], array $cbParams = []
	) {
		$version = $opts['version'] ?? null;
		$pcTTL = $opts['pcTTL'] ?? self::TTL_UNCACHEABLE;
		$pCache = ( $pcTTL >= 0 )
			? $this->getProcessCache( $opts['pcGroup'] ?? self::PC_PRIMARY )  # line 1296
			: null;

		if ( $pCache && $this->callbackDepth == 0 ) {
			$cached = $pCache->get( $this->getProcessCacheKey( $key, $version ), $pcTTL, false );   # line 1303
			if ( $cached !== false ) {
				$this->logger->debug( "getWithSetCallback($key): process cache hit" );  # line 1305
				return $cached;
			}
		}
Notes

Stuff is very messed up here:

  1. The error Call to undefined method Monolog\Logger::get() is attribued to line WANObjectCache.php:L1303, which is calling $pCache->get() where $pCache is an instance of MapCacheLRU. It is meant to be calling MapCacheLRU->get().
  2. For some reason the engine is thinking about Monolog\Logger, which is strange. For what it's worth, there is an instance of that object "near by" – two lines down on line 1305. Could it be confusing the two??
  3. The stack trace says WANObjectCache->getWithSetCallback was called from CentralAuthUser.php:L558. This is also odd, because that line is neither the start nor the end of the function call. See snippet above. Looks like this may be off by a few as well.

Might be opcache corruption.

Related Objects

StatusSubtypeAssignedTask
ResolvedKrinkle
OpenNone
Resolvedjijiki
OpenJoe
Resolveddancy
Resolvedthcipriani
Resolveddancy
Resolveddancy
OpenKrinkle
DuplicatePRODUCTION ERRORdancy
DuplicatePRODUCTION ERROR LarsWirzenius
DuplicatePRODUCTION ERRORCDanis
ResolvedPRODUCTION ERRORUrbanecm
DeclinedPRODUCTION ERRORNone
DuplicatePRODUCTION ERRORJoe

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As an additional data point, note that in the merged T264241: Class 'LathMathML' not found task from yesterday, it was a different class / file.

Yeah, that's a classic string one-byte flip issue which does seem to be random and rarely seen twice the same way.

Some upstream bugs for the record from the old task, to be copied into the task description:

  • timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
  • host: mw2252
  • message:

[2b171d8b-48ec-480d-b7a4-187dd3af259c] /w/api.php?titles=Image%3ANorrlands_nation_Nya_entr%C3%A9n2.jpg&iiprop=url&iiurlwidth=120&iiurlheight=120&prop=imageinfo&format=json&action=query TypeError from line 395 of /srv/mediawiki/php-1.36.0-wmf.13/extensions/WikibaseCirrusSearch/src/Hooks.php: Return value of Wikibase\Search\Elastic\Hooks::getWBCSConfig() must be an instance of Vikibase\Search\Elastic\WikibaseSearchConfig, instance of Vikibase\Search\Elastic\WikibaseSearchConfig returned

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

No this is the actual corruption.

must be an instance of Vikibase\Search\ [...]

Is this some sort of copying glitch?

No this is the actual corruption.

Thanks. The "string one-byte flip" - which Krinkle mentioned earlier - has become self-aware and is punning us.

  • timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
  • host: mw2252

Please note: this was noticed pretty early but the server was not depooled by mistake.

Noting for the record in all cases where there's a bogus changed character in a string, the bad character is always one less than what it should be (i.e., it's not a bit flip, it's a decrement).

That happened during 1.36.0-wmf.25 promotion to testwiki. We then had three servers showing all the symptoms of suffering from an opcache corruption which got cleared by restarting php fpm.

The UTC timeline is:

  • 18:59 scap finished
  • 19:34 - 19:47 mw1362 throws several kind of exceptions at a high rate:
    • Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
    • Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found
    • Error from line 119 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/client/includes/DataAccess/DataAccessSnakFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found

The first one does not make sense code wise. The second is a class name had a bit flip leading to a FE transposition (that is a known symptom as well).

  • 19:41 mw1362: apache restarted
  • 19:47 mw1362 depooled (clears the log spam since it is no more receiving traffic)
  • 19:54 - 20:20 mw1344 throws: Error from line 599 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey()
  • 20:04 mw1344: apache restarted
  • 20:20 mw1344: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)
  • 20:28 mw1362 repooled
  • 20:28 - 20:38 mw1362 throws: Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
  • 20:38 mw1362: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

mw1362 started misbehaving 35 minutes AFTER scap has completed. It is probably related, but such delay is unknown to me at least.


A theory we had over IRC:

We had a puppet change merged at 19:19 which drops ServerAdmin webmaster@wikimedia.org from the Apache config: caa6f65dd48a12b38d5ae7b63091f9662e4c3ef4 . https://gerrit.wikimedia.org/r/c/operations/puppet/+/651649 . That is the sole known change that happened just before mw1362 started misbehaving.

As I understand it the Apache config change is rolled via Puppet which randomly run accross the fleet, that then causes Apache2 to reload. That might explain why mw1344 started misbehaving 20 minutes later (19:54). I looked at /etc/cron.d/puppet on those servers to compare it with the start of the issue:

ServerIssueCron
mw136219:3419,49
mw134419:5410,40

So maybe it is not related. I don't have access to the Apache/Puppet/Systemd logs to identify whether that is related.

Else well that is unknown to me, do we ever had an opcache corruption randomly appearing without a code deployment?

Do we need to update the title / create a 2021 task?

Krinkle renamed this task from PHP7 corruption reports in 2020 (Call on wrong object, etc.) to PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.).Jan 5 2021, 9:44 PM

That happened during 1.36.0-wmf.25 promotion to testwiki. We then had three servers showing all the symptoms of suffering from an opcache corruption which got cleared by restarting php fpm.

The UTC timeline is:

  • 18:59 scap finished
  • 19:34 - 19:47 mw1362 throws several kind of exceptions at a high rate:
    • Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
    • Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found
    • Error from line 119 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/client/includes/DataAccess/DataAccessSnakFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found

The first one does not make sense code wise. The second is a class name had a bit flip leading to a FE transposition (that is a known symptom as well).

  • 19:41 mw1362: apache restarted
  • 19:47 mw1362 depooled (clears the log spam since it is no more receiving traffic)
  • 19:54 - 20:20 mw1344 throws: Error from line 599 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey()
  • 20:04 mw1344: apache restarted
  • 20:20 mw1344: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

Regarding 1362 and 1344, both servers reached their max cached keys *way after* scap. From the graphs it seems that opcache was cleared and the problems started. The number of keys during the scap run was below the threshold (<30k keys), so php-fpm was not restarted.

image.png (1×3 px, 223 KB)
mw1362

image.png (1×3 px, 239 KB)
mw1344

  • 20:28 mw1362 repooled
  • 20:28 - 20:38 mw1362 throws: Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument
  • 20:38 mw1362: /usr/local/sbin/restart-php7.2-fpm (that clears the issue)

I don't know if this issue is related or not, if php-fpm on mw1362 was not restarted while the host was depooled, then it probably is.

To my understanding, issues arise when opcache reaches its max cached keys and restarts itself, and it is only fixable with a php-fpm restart. We can lower the threshold even more or/and maybe increase the number of max cached keys. It appears to be the case in the last opcache corruptions (this occurrence + the ones back in October), so I suspect those are the conditions under which this problem surfaces. Lastly, I think this issue has been happening only to API servers, but I will have to verify this.

Related: T253673#6569013, T264362

@jijiki thanks for the investigation! We were kind of wondering whether the Apache reload might have triggered the opcache issue which is ruled out. Thanks !

I'm seeing this several times a week and have for several months. I haven't reported it before since it's not essential prod code and we've been focussing on mitigation rather than further gathering/investigation, but figured I might as well report it here since other people keep finding it and asking if there's a task for it:

PHP Fatal error: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in /srv/mediawiki/src/XWikimediaDebug.php on line 16

reqId: "d4747ce3-deae-4ac7-bcc9-6de99b4adb24"
host: "wtp1039"
timestamp: "2021-01-10T10:33:31.480676+00:00"

affects both mw* and wpt*.

Change 657398 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

Change 657398 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

@Krinkle: Robert doesn't seem to have been active for a few years nor involved in this task. Did you mean to assign it to him?

I'm admin and I can't disable it....

I'm admin and I can't disable it....

Phabricator is good at that. I guess it'll have to be CLI. Can't see at a glance any other things that trigger messed up via https://phabricator.wikimedia.org/maniphest/query/fA5RcoBdI.Po/#R

I'm admin and I can't disable it....

You should ask the members of the "phabricator-admin" shell admin group. (aklapper, gjg). They have sudo to run various commands to delete (aka. destroy) things via command line. That is higher level admin than web UI admin.

Krinkle renamed this task from PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) to PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.).Jan 11 2022, 11:37 PM
Krinkle added a subscriber: cjming.

From T297316: Error: Call to undefined method DOMDocument::getMulti()

from /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/MemcachedPeclBagOStuff.php(341)
#0 /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/MediumSpecificBagOStuff.php(624): MemcachedPeclBagOStuff->doGetMulti(array, integer)
#1 /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/wancache/WANObjectCache.php(597): MediumSpecificBagOStuff->getMulti(array)
#2 /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/wancache/WANObjectCache.php(1571): WANObjectCache->fetchKeys(array, array, NULL)
#3 /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/wancache/WANObjectCache.php(1518): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#4 /srv/mediawiki/php-1.38.0-wmf.9/extensions/SyntaxHighlight_GeSHi/includes/SyntaxHighlight.php(273): WANObjectCache->getWithSetCallback(string, integer, Closure)
#5 /srv/mediawiki/php-1.38.0-wmf.9/extensions/SyntaxHighlight_GeSHi/includes/SyntaxHighlight.php(315): SyntaxHighlight::highlightInner(string, string, array)
#6 /srv/mediawiki/php-1.38.0-wmf.9/extensions/SyntaxHighlight_GeSHi/includes/SyntaxHighlight.php(542): SyntaxHighlight::highlight(string, string)
#7 /srv/mediawiki/php-1.38.0-wmf.9/includes/HookContainer/HookContainer.php(338): SyntaxHighlight::onApiFormatHighlight(DerivativeContext, string, string, string)
#8 /srv/mediawiki/php-1.38.0-wmf.9/includes/HookContainer/HookContainer.php(137): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#9 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiHookRunner.php(94): MediaWiki\HookContainer\HookContainer->run(string, array)
#10 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiFormatBase.php(296): MediaWiki\Api\ApiHookRunner->onApiFormatHighlight(DerivativeContext, string, string, string)
#11 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiMain.php(2149): ApiFormatBase->closePrinter()
#12 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiMain.php(950): ApiMain->printResult(integer)
#13 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiMain.php(874): ApiMain->handleException(Wikimedia\RequestTimeout\RequestTimeoutException)
#14 /srv/mediawiki/php-1.38.0-wmf.9/includes/api/ApiMain.php(838): ApiMain->executeActionWithErrorHandling()
#15 /srv/mediawiki/php-1.38.0-wmf.9/api.php(90): ApiMain->execute()
#16 /srv/mediawiki/php-1.38.0-wmf.9/api.php(45): wfApiMain()
Krinkle claimed this task.

Any remaining "smells like opcache" problems we see can't be the cause of php-opcache revalidation mode since that mode is now disabled on production web servers as per T266055 and T253673.