Page MenuHomePhabricator

PHP Warning: Memcached::getMulti(): Memcached constructor was not called
Closed, ResolvedPublicPRODUCTION ERROR

Description

normalized_message
PHP Warning: Memcached::getMulti(): Memcached constructor was not called

Also:

PHP Warning: Memcached::add(): Memcached constructor was not called
exception.trace
from /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/MemcachedPeclBagOStuff.php(342)
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/MemcachedPeclBagOStuff.php(342): Memcached->getMulti(array)
#2 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/MediumSpecificBagOStuff.php(566): MemcachedPeclBagOStuff->doGetMulti(array, integer)
#3 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(508): MediumSpecificBagOStuff->getMulti(array)
#4 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(428): WANObjectCache->getMulti(array, array, array, integer)
#5 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1447): WANObjectCache->get(string, integer, array, integer)
#6 /srv/mediawiki/php-1.36.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1392): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#7 /srv/mediawiki/php-1.36.0-wmf.31/includes/resourceloader/ResourceLoaderWikiModule.php(508): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#8 /srv/mediawiki/php-1.36.0-wmf.31/includes/OutputPage.php(3014): ResourceLoaderWikiModule::preloadTitleInfo(ResourceLoaderContext, Wikimedia\Rdbms\MaintainableDBConnRef, array)
#9 /srv/mediawiki/php-1.36.0-wmf.31/includes/OutputPage.php(3072): OutputPage->getRlClient()
#10 /srv/mediawiki/php-1.36.0-wmf.31/includes/skins/SkinMustache.php(141): OutputPage->headElement(SkinVector)
#11 /srv/mediawiki/php-1.36.0-wmf.31/includes/skins/SkinTemplate.php(146): SkinMustache->generateHTML()
#12 /srv/mediawiki/php-1.36.0-wmf.31/includes/OutputPage.php(2631): SkinTemplate->outputPage()
#13 /srv/mediawiki/php-1.36.0-wmf.31/includes/exception/MWExceptionRenderer.php(147): OutputPage->output()
#14 /srv/mediawiki/php-1.36.0-wmf.31/includes/exception/MWExceptionRenderer.php(66): MWExceptionRenderer::reportHTML(WMFTimeoutException)
#15 /srv/mediawiki/php-1.36.0-wmf.31/includes/exception/MWExceptionHandler.php(106): MWExceptionRenderer::output(WMFTimeoutException, integer)
#16 /srv/mediawiki/php-1.36.0-wmf.31/includes/exception/MWExceptionHandler.php(185): MWExceptionHandler::report(WMFTimeoutException)
#17 /srv/mediawiki/php-1.36.0-wmf.31/includes/exception/MWExceptionHandler.php(156): MWExceptionHandler::handleException(WMFTimeoutException, string)
#18 [internal function]: MWExceptionHandler::handleUncaughtException(WMFTimeoutException)
#19 {main}
Impact

Unclear. Not seen this before.

Notes

The warning comes from memcached-pecl internals.

  • The stack trace is lacking an entry point, so I'm guessing this might be shutdown or post-send?
  • Or maybe this is how rest.php errors are always reported? (I've not seen many of those).
  • Also, given it's rest.php, why is this triggering a Skin response?

From a very brief look at internals it seems some internal state is lost and/or skipped.

  • It could be that our code regressed and we're actually no longer calling the constructor when we should, but I'm gonna say that's very unlikely or the warning would be more common.
  • It could be that it's a new bug or race condition introduced by upstream php and memcached-pecl that run on Debian Buster since we recently upgraded.
  • It could be that the WMFTimeoutException might be happening in a destructor call during gargage collection and maybe the state was there but already lost?
  • It could be that somehow WMFTimeoutException disrupted an assumed-atomic section in memcached-pecl and then later on during another getMulti call it assumed the setup was done when it wasn't actually?

Details

Request ID
YDWmMgpAMKQAAAlA0KUAAAAQ
Request URL
/w/rest.php/en.wikipedia.org/v3/page/pagebundle/…/…

Event Timeline

Krinkle triaged this task as High priority.Feb 24 2021, 4:06 AM
Krinkle created this task.

I tried to reproduce this locally with P14574, which throws an exception once per millisecond, and if an exception occurs during client destruction, it reuses the client. But it didn't turn up anything. Maybe in production?

tstarling lowered the priority of this task from High to Low.EditedMar 11 2021, 2:24 AM
tstarling removed a project: PHP 8.0 support.
tstarling added a subscriber: aaron.

@aaron and I reviewed the memcached PECL code, but we did not find anything which would cause this.

From the logs, there seems to have been a single instance of failure, causing a burst of 3102 events from a single server (wtp1046). I don't think this is high priority unless it happens more often.

Behaviour will be an exception with php8 - https://github.com/php-memcached-dev/php-memcached/commit/9a429d43f8a4aab9292287502a3393588e2255c1

An exception would actually be preferable. This is not a problem for PHP 8 support.

I wondered whether it's possible for an Excimer callback to execute after object creation but before the constructor is called. I used the following code:

$timer = new ExcimerTimer; 
$timer->setInterval( 1 ); 
$timer->setCallback(
    function() {
        throw new \Exception('ping');
    }
);
$timer->start();
$memcached = new Memcached;
$memcached = null;

Then I put gdb breakpoints on php_memc_object_new, zim_Memcached___construct and excimer_timer_interrupt. Gdb stopped in php_memc_object_new and I waited for the 1 second timer to expire. The constructor was then successfully called before the interrupt function. The first opportunity for the interrupt function is at the end of the DO_FCALL handler which calls __construct().

Similarly, with a breakpoint in php_memc_object_free_storage, I was able to confirm that there is no opportunity for the interrupt function to run between php_memc_object_free_storage and final destruction of the enclosing zval.

I did these tests on a recent version of PHP git master (8.1-dev, 3a867b982755501b514d19418086972a0cc693e7) and the current memcached PECL git master.

It could be that our code regressed and we're actually no longer calling the constructor when we should, but I'm gonna say that's very unlikely or the warning would be more common.

I think the only way that could happen is by subclassing the Memcached class, but we're not doing that.

It could be that the WMFTimeoutException might be happening in a destructor call during gargage collection and maybe the state was there but already lost?

The convention is that no userspace code runs between the free_obj handler and the destruction of all references. The garbage collector respects this by queuing objects to be deleted, incrementing their reference counts and calling all their __destruct() functions. Then in a second pass, it calls their free_obj handlers. The free_obj handler is not meant to run userspace code, and probably doesn't.

It could be that somehow WMFTimeoutException disrupted an assumed-atomic section in memcached-pecl and then later on during another getMulti call it assumed the setup was done when it wasn't actually?

That's basically the theory I'm testing, but I haven't been able to reproduce it. When you do $memcached = new Memcached, that is implemented as three opcodes:

  1. new
  2. Memcached::__construct()
  3. assignment

If the object is used between 1 and 2, this warning will be seen, but my test with gdb shows the interrupt flag is not being checked at that time. And even if an exception is thrown between 1 and 2, how would subsequent code get the object and call methods on it when the assignment (3) hasn't been done?

Destruction seems even less likely as a cause. It goes something like:

  1. refcount decremented to 0
  2. refcount incremented to 1
  3. __destruct() called
  4. free_obj called
  5. refcount decremented to 0
  6. storage freed

Here the atomic section is between stage 4 and 5, but it's hard to imagine the interrupt flag being checked at that time, because it doesn't re-enter the VM.

Not seen in 90 days. Closing for now.

Given the apparent impossibilty, this might be another opcache victim (ref T245183). It'd be a new kind of end result, but again, maybe not that unlikely. E.g. we've already seen opcache corruption cause method calls to be applied to the wrong object in memory. Ideally that causes a fatal error, but if it succeeded, then it might fail later on with something else. If the constructor call ended up applied to a different class name, one that exists, then perhaps that's what is ocurring here, where later on we find the ctor was never called. Probably not opcache, but I can't rule it out yet either.