Page MenuHomePhabricator

LBFactoryMulti.php: PHP Notice: Undefined index:
Closed, ResolvedPublicPRODUCTION ERROR

Description

Logstash for "LBFactoryMulti.php PHP Notice: Undefined index:": https://logstash.wikimedia.org/goto/6cdfce6eddba2679f98de0dc215cd3f6

Logstash for channel:AdHocDebug AND message:T232613 over 24 hours
https://logstash.wikimedia.org/goto/06093b1efb2db29266818a4f200c8150

Error message
[XXj0RQpAIDsAAI0kXXYAAADG]
/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&meta=wikimediaeditortaskscounts
ErrorException from line 297 of /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php:
PHP Notice: Undefined index:
php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php
295     private function makeServerArray( array $serverTemplate, array $groupLoads ) {
296         // The master server is the first host explicitly listed in the generic load group
297         if ( !$groupLoads[ILoadBalancer::GROUP_GENERIC] ) {
298             throw new UnexpectedValueException( "Empty generic load array; no master defined." );
299         }
Impact

There are just a few occurrences here and there.

The first occurence was on 2019-09-11T13:12:40 which is when I have promoted group 1 to 1.34.0-wmf.22 (which promoted high traffic sites wikidatawiki and commonswiki).

Definitely only happens with 1.34.0-wmf.22

Apparently ONLY php7.2

Notes

I believe that is then followed by:

message
[{exception_id}] {exception_url}   UnexpectedValueException from line 298 of /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Empty generic load array; no master defined.
stacktrace
#0 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(278): Wikimedia\Rdbms\LBFactoryMulti->makeServerArray(array, array)
#1 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(206): Wikimedia\Rdbms\LBFactoryMulti->newLoadBalancer(array, array, boolean, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(213): Wikimedia\Rdbms\LBFactoryMulti->newExternalLB(string, integer)
#3 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Utils.php(77): Wikimedia\Rdbms\LBFactoryMulti->getExternalLB(string)
#4 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Utils.php(61): MediaWiki\Extension\WikimediaEditorTasks\Utils::getDB(integer, MediaWiki\MediaWikiServices, string, string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/ServiceWiring.php(35): MediaWiki\Extension\WikimediaEditorTasks\Utils::getUserCountsDB(integer, MediaWiki\MediaWikiServices)
#6 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/services/ServiceContainer.php(458): Wikimedia\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#7 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/services/ServiceContainer.php(427): Wikimedia\Services\ServiceContainer->createService(string)
#8 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/WikimediaEditorTasksServices.php(72): Wikimedia\Services\ServiceContainer->getService(string)
#9 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Api/ApiQueryWikimediaEditorTasksCounts.php(52): MediaWiki\Extension\WikimediaEditorTasks\WikimediaEditorTasksServices->getCounterDao()
#10 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Api/ApiQueryWikimediaEditorTasksCounts.php(35): MediaWiki\Extension\WikimediaEditorTasks\Api\ApiQueryWikimediaEditorTasksCounts->run()
#11 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiQuery.php(255): MediaWiki\Extension\WikimediaEditorTasks\Api\ApiQueryWikimediaEditorTasksCounts->execute()
#12 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(1599): ApiQuery->execute()
#13 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(537): ApiMain->executeAction()
#14 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(508): ApiMain->executeActionWithErrorHandling()
#15 /srv/mediawiki/php-1.34.0-wmf.22/api.php(87): ApiMain->execute()
#16 /srv/mediawiki/w/api.php(3): require(string)
#17 {main}

Details

Request ID
XXj0RQpAIDsAAI0kXXYAAADG
Request URL
https://www.wikidata.org/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&meta=wikimediaeditortaskscounts
Stack Trace
#0 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(297): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(278): Wikimedia\Rdbms\LBFactoryMulti->makeServerArray(array, array)
#2 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(206): Wikimedia\Rdbms\LBFactoryMulti->newLoadBalancer(array, array, boolean, integer)
#3 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(213): Wikimedia\Rdbms\LBFactoryMulti->newExternalLB(string, integer)
#4 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Utils.php(77): Wikimedia\Rdbms\LBFactoryMulti->getExternalLB(string)
#5 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Utils.php(61): MediaWiki\Extension\WikimediaEditorTasks\Utils::getDB(integer, MediaWiki\MediaWikiServices, string, string, array)
#6 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/ServiceWiring.php(35): MediaWiki\Extension\WikimediaEditorTasks\Utils::getUserCountsDB(integer, MediaWiki\MediaWikiServices)
#7 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/services/ServiceContainer.php(458): Wikimedia\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#8 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/services/ServiceContainer.php(427): Wikimedia\Services\ServiceContainer->createService(string)
#9 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/WikimediaEditorTasksServices.php(72): Wikimedia\Services\ServiceContainer->getService(string)
#10 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Api/ApiQueryWikimediaEditorTasksCounts.php(52): MediaWiki\Extension\WikimediaEditorTasks\WikimediaEditorTasksServices->getCounterDao()
#11 /srv/mediawiki/php-1.34.0-wmf.22/extensions/WikimediaEditorTasks/src/Api/ApiQueryWikimediaEditorTasksCounts.php(35): MediaWiki\Extension\WikimediaEditorTasks\Api\ApiQueryWikimediaEditorTasksCounts->run()
#12 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiQuery.php(255): MediaWiki\Extension\WikimediaEditorTasks\Api\ApiQueryWikimediaEditorTasksCounts->execute()
#13 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(1599): ApiQuery->execute()
#14 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(537): ApiMain->executeAction()
#15 /srv/mediawiki/php-1.34.0-wmf.22/includes/api/ApiMain.php(508): ApiMain->executeActionWithErrorHandling()
#16 /srv/mediawiki/php-1.34.0-wmf.22/api.php(87): ApiMain->execute()
#17 /srv/mediawiki/w/api.php(3): require(string)
#18 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I looked at it, but I can't see how the error is occurring.

#0 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(297): MWExceptionHandler::handleError(integer, string, string, integer, array)

What happens at LBFactoryMulti.php line 297 is a PHP notice that an array doesn't have the empty-string as an index. Line 297 is

if ( !$groupLoads[ILoadBalancer::GROUP_GENERIC] ) {

and ILoadBalancer::GROUP_GENERIC is the empty string, so the error coming from that line is reasonable depending on $groupLoads.

So where does $groupLoads come from?

private function makeServerArray( array $serverTemplate, array $groupLoads ) {
    // The master server is the first host explicitly listed in the generic load group
    if ( !$groupLoads[ILoadBalancer::GROUP_GENERIC] ) {

Ok, coming from the caller.

#1 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(278): Wikimedia\Rdbms\LBFactoryMulti->makeServerArray(array, array)

private function newLoadBalancer( $serverTemplate, $groupLoads, $readOnlyReason, $owner ) {
    $lb = new LoadBalancer( array_merge(
        $this->baseLoadBalancerParams( $owner ),
        [
            'servers' => $this->makeServerArray( $serverTemplate, $groupLoads ),

Coming from the caller again.

#2 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(206): Wikimedia\Rdbms\LBFactoryMulti->newLoadBalancer(array, array, boolean, integer)

public function newExternalLB( $cluster, $owner = null ) {
    if ( !isset( $this->externalLoads[$cluster] ) ) {
        throw new InvalidArgumentException( "Unknown cluster '$cluster'" );
    }

    return $this->newLoadBalancer(
        array_merge(
            $this->serverTemplate,
            $this->externalTemplateOverrides,
            $this->templateOverridesByCluster[$cluster] ?? []
        ),
        [ ILoadBalancer::GROUP_GENERIC => $this->externalLoads[$cluster] ],
        $this->readOnlyReason,
        $owner
    );

It's constructing an array right there that should always have that index set to something. Even if the value turns out to be something unexpected, the index will exist. WTF?

@Anomie thank you for the detailed reverse lookup, I went with the same conclusion and went tagging PHP 7.2 support .
It happens from time to time, its solely on php7.2 although mw1347 has twice more reports.

Thank you Timo for the extra bug reports! I like Tim trick to cause a coredump so that the issue can eventually be debugged:

This would probably be enough to make it successfully do a core dump in production:

posix_setrlimit( POSIX_RLIMIT_CORE, 10e9, POSIX_RLIMIT_INFINITY );
posix_kill( posix_getpid(), SIGABRT );

;)

@Agusbou2015: Please refrain from adding comments which are not helpful.

mw1347 and mw1348 receive more traffic than the rest of the php api servers, so it makes sense this happens more frequently there.

If we add code that would make us exit with SIGABRT, I'd like to guard it with a config variable and only activate it on a couple hosts at first.

Also: in order to generate core dumps, we will have to change the php-fpm configuration. So anything in that direction needs SRE support.

jijiki renamed this task from LBFactoryMulti.php PHP Notice: Undefined index: to LBFactoryMulti.php: PHP Notice: Undefined index: .Sep 12 2019, 2:26 PM

FWIW, I did some very simple tests locally. Basically, I have a loop which enters the faulty path a few times and checks whether the key exists. I did it on both PHP72 and 73; I've enabled opcache; I've made sure that the GC is enabled, and made it kick in with gc_collect_cycles. I haven't been able to reproduce this bug even a single time. Which is not surprising, but maybe it should suggest that it's really not that simple (which, again, is not surprising, is it?).

Change 536222 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/core@wmf/1.34.0-wmf.22] Core dump on empty string array key that should exist

https://gerrit.wikimedia.org/r/536222

Change 536243 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Enable coredump on some mysterious php7.2 failure

https://gerrit.wikimedia.org/r/536243

Change 536222 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.22] Core dump on empty string array key that should exist

https://gerrit.wikimedia.org/r/536222

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:31:53Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: T232613 Add ability to core dump on empty string array key that should exist (wmf.22 only, flagged off) (duration: 01m 03s)

Change 536243 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable coredump on some mysterious php7.2 failure

https://gerrit.wikimedia.org/r/536243

Mentioned in SAL (#wikimedia-operations) [2019-09-12T18:59:23Z] <hashar@deploy1001> Synchronized wmf-config/CommonSettings.php: Enable coredump on some mysterious php7.2 failure - T232613 (duration: 01m 04s)

Hunting for a potential coredump file, in logstash I end up seeing:

ErrorException from line 304 of /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php:
PHP Warning: Use of undefined constant SIGABRT - assumed 'SIGABRT' (this will throw an Error in a future
includes/libs/rdbms/lbfactory/LBFactoryMulti.php
295     private function makeServerArray( array $serverTemplate, array $groupLoads ) {
296         global $wgT232613;
297 
298         // The master server is the first host explicitly listed in the generic load group
299         if ( !$groupLoads[ILoadBalancer::GROUP_GENERIC] ) {
300             if ( !array_key_exists( ILoadBalancer::GROUP_GENERIC, $groupLoads )
301                 && isset( $wgT232613 ) && $wgT232613 === true ) {
302                 // Core dump
303                 posix_setrlimit( POSIX_RLIMIT_CORE, (int)10e9, POSIX_RLIMIT_INFINITY );
304                 posix_kill( posix_getpid(), SIGABRT );
305             }
306             throw new UnexpectedValueException( "Empty generic load array; no master defined." );
307         }

The SIGABRT constant is defined by the pcntl extension which supposedly is compiled in / always enabled?

mw1348:~$ php -c /etc/php/7.2/fpm -r 'var_dump( SIGABRT );'
int(6)

On mwdebug1001.eqiad.wmnet:

/srv/mediawiki/w/pcntl.php
<pre><?php
var_dump( PHP_VERSION );
var_dump( extension_loaded( 'pcntl' ) );
var_dump( SIGABRT );
hhvm has pcntl
string(11) "5.6.99-hhvm"
bool(true)
int(6)
php7.2-fpm lacks pcntl
string(47) "7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1"
bool(false)
string(7) "SIGABRT"

I'm sure SRE will have a good solution about loading pcntl etc. But if we need a coredump now, you could also just replace SIGABRT with 6...

Change 536338 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/core@wmf/1.34.0-wmf.22] Hardcode posix signal and log coredump

https://gerrit.wikimedia.org/r/536338

Change 536338 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.22] Hardcode posix signal and log coredump

https://gerrit.wikimedia.org/r/536338

Mentioned in SAL (#wikimedia-operations) [2019-09-12T21:18:24Z] <hashar@deploy1001> Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Hardcode posix signal and log coredump - T232613 (duration: 01m 04s)

Logstash for channel:AdHocDebug AND message:T232613 over 24 hours
https://logstash.wikimedia.org/goto/06093b1efb2db29266818a4f200c8150

Luckily it got a coredump immediately https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.09.12/mediawiki?id=AW0nWSIAZKA7RpirOT7G&_g=h@1251ff0

On mw1227.eqiad.wmnet there should thus be a php coredump under /var/tmp/core/.

Just to clarify: I do not know how to debug that issue using gdb and the core file :-\

It didn't generate a core dump. In my defence I did only say that it would "probably be enough". It doesn't work because PR_SET_DUMPABLE is used to disable core dumps. To enable core dumps, the PHP FPM pool option process.dumpable needs to be set to "yes". This can safely be set to "yes" globally since rlimit_core still defaults to zero, so it doesn't dump core on segfault unless the limit has been manually raised. You can see the current configuration with php-fpm7.2 -tt, and I'm also going to upload a test script.

Change 536398 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add coredump action to fatal-error.php

https://gerrit.wikimedia.org/r/536398

Change 536400 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] In PHP FPM, enable process.dumpable everywhere

https://gerrit.wikimedia.org/r/536400

Change 536400 merged by Giuseppe Lavagetto:
[operations/puppet@production] In PHP FPM, enable process.dumpable everywhere

https://gerrit.wikimedia.org/r/536400

Change 536398 merged by jenkins-bot:
[operations/mediawiki-config@master] Add coredump action to fatal-error.php

https://gerrit.wikimedia.org/r/536398

We have a first core dump on mw1348 - I moved it under /root/T232613, if you need access ping me on IRC.

Change 536580 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/core@wmf/1.34.0-wmf.22] Add more log and context for T232613 logging

https://gerrit.wikimedia.org/r/536580

Change 536582 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] profile::mediawiki::api: Setup systemd-coredump on api servers

https://gerrit.wikimedia.org/r/536582

I'll dump my notes here, sorry about it being long and boring. Summary: no answers yet.

The array does appear to have an element with an empty string key:

(gdb) source ~tstarling/src/php7.2-7.2.22/.gdbinit
...
(gdb) print executor_globals.current_execute_data.prev_execute_data
$15 = (zend_execute_data *) 0x7f2d85a1dd90
(gdb) print_cvs $15
...
[1] 'groupLoads'
[0x7f2d85a1ddf0] (refcount=2) array:     Hash(1)[0x7f2c2fb15a48]: {
      [0]  => [0x7f2c2d9dc8a0] (refcount=2) array: 
}
...
(gdb) print_zstr ((zend_array*)0x7f2c2fb15a48).arData[0].key
string(0) ""

The value with that key

Hash(3)[0x7f2c46114a70]: {
  [0] 10.64.32.11 => [0x7f2c46114ac8] long: 0
  [1] 10.64.0.97 => [0x7f2c46114ae8] long: 1
  [2] 10.64.48.111 => [0x7f2c46114b08] long: 1
}

It looks pretty normal. So you have to wonder what ILoadBalancer::GROUP_GENERIC is at this point. The opcodes from the line of the notice emission:

ZEND_FETCH_CLASS_CONSTANT 32 16 => 224
ZEND_FETCH_DIM_R  96 224 => 208
ZEND_JMPNZ 208 1024 => 0

The handler of the first of these is ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER, which has:

if (EXPECTED(CACHED_PTR(Z_CACHE_SLOT_P(EX_CONSTANT(opline->op2))))) {
	value = CACHED_PTR(Z_CACHE_SLOT_P(EX_CONSTANT(opline->op2)));

So the value of the constant is stored in a cache. Let's figure out where that is:

(gdb) print ((zval*)(((char*)($ex.literals)) + 16))
$51 = (zval *) 0x7f2c466de5d8
(gdb) printzv $51
[0x7f2c466de5d8] (refcount=1) string: GROUP_GENERIC
(gdb) print $51.u2.cache_slot
$52 = 8
(gdb) print ((void**)((char*)$ex.run_time_cache + (8)))[0]
$53 = (void *) 0x7f2d85a94638
(gdb) printzv $53
[0x7f2d85a94638] (refcount=1) string: 
(gdb) print ((zval*)$53).value.str[0]
$56 = {gc = {refcount = 1, u = {v = {type = 6 '\006', flags = 6 '\006', gc_info = 0}, type_info = 1542}}, h = 9223372036854781189, 
  len = 0, val = ""}

The cache of ILoadBalancer::GROUP_GENERIC is a perfectly ordinary empty string, in fact it is the same zend_string* pointer value (0x7f2c3fcffdf0) as the one used in the Bucket of the array. So I have confirmed that $groupLoads has an element with an empty string key, and ILoadBalancer::GROUP_GENERIC is also an empty string.

Time for a break, but there's still plenty of things to check, for example, I want to look at variable 208 which was the input to JMPNZ.

Change 536580 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.22] Add more log and context for T232613 logging

https://gerrit.wikimedia.org/r/536580

Mentioned in SAL (#wikimedia-operations) [2019-09-13T15:02:52Z] <hashar@deploy1001> Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Add more log and context for T232613 logging - T232613 (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-09-13T15:02:52Z] <hashar@deploy1001> Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Add more log and context for T232613 logging - T232613 (duration: 01m 04s)

And I got some enhanced log from a core dump on mw1226.eqiad.wmnet at 2019-09-13T15:02:57

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.09.13/mediawiki?id=AW0rJhC6CfBFUvHR9WZA&_g=h@44136fa

ContextPhp codeValue
T232613_GROUP_GENERIC(string)( '' === ILoadBalancer::GROUP_GENERIC )''
T232613_equality(string)( '' === ILoadBalancer::GROUP_GENERIC )1

T232613_groupLoads

 array (
...
  '' => 
  array (
    'db1109' => 0,
    'db1104' => 300,
    'db1099:3318' => 100,
    'db1092' => 200,
    'db1101:3318' => 100,
    'db1087' => 1,
    'db1126' => 500,
  ),
)

The message got logged because !$groupLoads[ILoadBalancer::GROUP_GENERIC] and !array_key_exists( ILoadBalancer::GROUP_GENERIC, $groupLoads ) both evaluated to true. So whatever ILoadBalancer::GROUP_GENERIC is is sometime not found in the array by the low level Zend instruction? :-\

Twist, I could not find the debug message from the other function (newExternalLB) :-\ And looking at the code, that specific issue came from a different caller Wikimedia\Rdbms\LBFactoryMulti->newMainLB. So will need some logging there?

$dbGroupLoads = $this->groupLoadsByDB[$this->getDomainDatabase( $domain )] ?? [];
unset( $dbGroupLoads[ILoadBalancer::GROUP_GENERIC] ); // cannot override

And then:

array_merge( $this->groupLoadsBySection[$section], $dbGroupLoads ),

To be investigated.

Change 536622 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Disable adhoc core dump logging

https://gerrit.wikimedia.org/r/536622

Change 536622 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable adhoc core dump logging

https://gerrit.wikimedia.org/r/536622

Mentioned in SAL (#wikimedia-operations) [2019-09-13T15:34:40Z] <hashar@deploy1001> Synchronized wmf-config/CommonSettings.php: Disable adhoc core dump logging - T232613 (duration: 01m 04s)

I have disabled the adhoc logging and the core dumps. I guess we have enough core by now.

Some other code path would have needed the adhoc debug logging to compare the state of variables.

Continuing the core dump analysis:

  • Variable 224 is the empty string. More confirmation that ILoadBalancer::GROUP_GENERIC was correct.
  • Variable 208 was assigned several more times before the core dump point so was not preserved.

So I had a closer look at what ZEND_FETCH_DIM_R would have done.

	h = zend_string_hash_val(key);
	arData = ht->arData;
	nIndex = h | ht->nTableMask;
	idx = HT_HASH_EX(arData, nIndex);
	while (EXPECTED(idx != HT_INVALID_IDX)) {
  • h is 9223372036854781189
  • ht->nTableMask is 0xfffffff8 (i.e. -8, the default)
  • nIndex would have been 4294967293 (i.e. -3)
  • idx would have thus been 4294967295, which is HT_INVALID_IDX

Finally something abnormal! Here's the contents of the index table:

(gdb) set $ht = (zend_array *) 0x7f2c2fb15a48
(gdb) print *(((uint32_t*)$ht.arData)-8)@8
$40 = {0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}

So only -8 is valid.

Negative indexes into arData contain a hashtable of 32-bit integers pointing into the Buckets stored as positive indexes into arData. For some reason, this array has a hashtable of the default size of 8, where -8 is the only valid masked hash value, but its only valid Bucket contains an empty string key with a masked hash value of -3. So if you iterate over this array, you see the empty string key, but if you try to fetch something from it by key, the value is not found.

This should be tried as a workaround: opcache.interned_strings_buffer = 0 in php.ini.

When the element is added to $groupLoads, the hash of the string is copied into the Bucket:

	if (!ZSTR_IS_INTERNED(key)) {
		zend_string_addref(key);
		ht->u.flags &= ~HASH_FLAG_STATIC_KEYS;
		zend_string_hash_val(key);
	}
	p->h = h = ZSTR_H(key);

But p->h of the Bucket in question is zero, which is consistent with the table index of -8 being used instead of -3. The preceding lines show that the hash is only initialised if the string is not interned, so for interned strings, the hash value is presumed to be correct. By the time the core dump occurs, it apparently is correct, but when the array was constructed, it was wrong.

opcache has a system for saving and restoring the array of interned strings, which I think may be implicated. In the core dump, the empty string stored by zend_empty_string is not present in interned_strings_permanent, which is the HashTable meant to track interned strings, which opcache saves and restores. Or at least, that string is not in accessible via the relevant negative index into arData. There is an empty string in arData[0] but it has a different pointer value and is not being used.

It's worrying that zend_empty_string is not in interned_strings_permanent, since tracing a non-opcache CLI request shows that being added to that hashtable is how it normally gets its hash value initialised.

Setting opcache.interned_strings_buffer = 0 will disable saving and restoring of interned_strings_permanent. If the errors stop, that will isolate the bug to opcache's handling of this hashtable.

Changing ILoadBalancer::GROUP_GENERIC to a string of length greater than 1 might also work as a permanent workaround. It could be say '[generic]'. But the other bugs would remain unfixed unless they also changed their array keys.

Mentioned in SAL (#wikimedia-operations) [2019-09-15T14:35:00Z] <_joe_> test: setting opcache.interned_strings_buffer to 0 on mw1348 for T232613

Mentioned in SAL (#wikimedia-operations) [2019-09-15T14:35:00Z] <_joe_> test: setting opcache.interned_strings_buffer to 0 on mw1348 for T232613

Apparently it's working. Cannot post a link to logstash because I'm from mobile and it's already painful enough to use logstash, plus it won't let me copy the share link. Anyway, no more errors since 14:30 UTC.

As can be seen on logstash, as @Daimona mentioned, errors suddenly stopped after I disabled the interned strings buffer. While it's early to evaluate fully any impact on performance, I would say that the effect is clearly not very penalizing. See the 95th percentile for successful responses for mw1348.

As a side note - I did have to live-hack the prometheus metrics collector to stop collecting the interned strings metrics - we had a hole in stats because of that in the first minutes after I disabled the interned strings buffer.

Tomorrow @jijiki and I will propagate the setting everywhere.

The theory that zend_empty_string fails to have its hash value initialised when opcache is enabled was not correct. It is already initialised before php-fpm forks. So the new theory is that it is initialised correctly but something sets it back to zero.

I attached gdb to a random worker in production, and set a hardware watchpoint on zend_empty_string.h. This watchpoint was eventually triggered by a call to zend_string_forget_hash_val() from the memcached extension. This is a bug in the memcached extension that was fixed in this commit: https://github.com/php-memcached-dev/php-memcached/commit/5f28025c15c87d0895f39def77068c8fd66d442a . Sending an empty string to memcached, say using set(), causes the hash value of the empty string to be reset.

This gives us the following reproduction procedure, which works in eval.php:

$memcached = new Memcached;
$memcached->addServer('127.0.0.1:11212');
// Set zend_empty_string.h to zero
$memcached->set('test', '');
// Make an array with a broken hashtable
$empty = '';
$a = [$empty => 1];
// Accessing $b[''] re-initialises zend_empty_string.h
$b = [];
print isset($b['']);
var_dump($a['']); // prints NULL since zend_empty_string.h is correct but it's wrong in $a

But it's now a mystery as to why opcache.interned_strings_buffer = 0 helps, since this works without opcache.

The commit Tim mentions is to solve https://github.com/php-memcached-dev/php-memcached/issues/338 (which has another test case).

We have php-memcached 3.0.1, I guess we could either cherry pick the commit or upgrade to 3.0.4 ? ( changelog)

Change 536582 abandoned by Effie Mouzeli:
profile::mediawiki::api: Setup systemd-coredump on api servers

Reason:
In favour of Ibd41534b78cfdbda0086ce9e1352456d57fb1722 and I38ab60905ab2e1f02d293b70fc8fb8b920adbff9

https://gerrit.wikimedia.org/r/536582

Mentioned in SAL (#wikimedia-operations) [2019-09-16T10:29:06Z] <_joe_> installing a patched php-memcached on mw1347 T232613

But it's now a mystery as to why opcache.interned_strings_buffer = 0 helps, since this works without opcache.

One hypothesis we advanced on IRC is that opcache persisting interned strings cross-process is necessary to trigger the bug in production.

Anyways, I built a patched memcached client and confirmed I can't reproduce anymore:

# Not patched
mw1342:~$ php repro.php 
Notice: Undefined index:  in /home/oblivian/repro.php on line 12
NULL

# patched
mw1347:~$ php repro.php 
int(1)

I'll wait a bit to see how mw1347 behaves, and apply the fix cluster-wide.

I see no further occurrences of the bug in logstash either for mw1347 - not that I had many doubts at this point.

I'll upload the package and upgrade the cluster.

Mentioned in SAL (#wikimedia-operations) [2019-09-16T11:06:33Z] <_joe_> uploaded php-memcached_3.0.1+2.2.0-1~wmf3 to component/php72 for stretch T232613

Mentioned in SAL (#wikimedia-operations) [2019-09-16T11:50:36Z] <_joe_> rolling restart of php-fpm in codfw to pick up the new memcached extension T232613

Mentioned in SAL (#wikimedia-operations) [2019-09-16T11:56:25Z] <_joe_> rolling restart of php-fpm in eqiad to pick up the new memcached extension T232613

Mentioned in SAL (#wikimedia-operations) [2019-09-16T12:07:19Z] <_joe_> rolling restart ended on eqiad T232613

hashar closed this task as Resolved.EditedSep 16 2019, 12:51 PM

The error is gone, so tentatively fixed :]

Thank you very much @tstarling and @Joe for the debugging / tests etc! That strikes three bugs in one shot.

wow. nice sleuthing !!!

Seriously! Well done everyone

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks

Change 558769 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] CommonSettings.php: Remove the disabled "temporary" code for T232613

https://gerrit.wikimedia.org/r/558769

Change 558769 merged by jenkins-bot:
[operations/mediawiki-config@master] CommonSettings.php: Remove the disabled "temporary" code for T232613

https://gerrit.wikimedia.org/r/558769