Page MenuHomePhabricator

Stack overflow in WikibaseRepo initialization on Wikidata Beta
Closed, ResolvedPublic

Description

There seems to be some call cycle in the WikibaseRepo initialization on wikidata.beta.wmflabs.org. From logstash:

"0":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":927},
"1":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":773,"function":"getStore","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"2":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":408,"function":"getEntityLookup","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"3":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":380,"function":"newValidatorBuilders","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"4":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/WikibaseRepo.datatypes.php","line":328,"function":"getDefaultValidatorBuilders","class":"Wikibase\\Repo\\WikibaseRepo","type":"::","args":[]},
"5":{"function":"Closure$#37","args":[]},
"6":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":624,"function":"call_user_func","args":["Closure$#37;2699"]},
"7":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":615,"function":"getEntityDiffer","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"8":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":934,"function":"getEntityChangeFactory","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"9":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":773,"function":"getStore","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},
"10":{"file":"/srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php","line":408,"function":"getEntityLookup","class":"Wikibase\\Repo\\WikibaseRepo","type":"->","args":[]},

Earliest log message for this error appears to be from today (2018-03-05) 14:37 UTC. There haven’t been any recent Wikibase commits since last Wednesday except some comment or i18n updates, so I’m not sure what could have caused this.

Event Timeline

Addshore triaged this task as High priority.Mar 5 2018, 8:02 PM
Addshore added a subscriber: Addshore.

Marked as high so that we investigate this, in theory the cause would be in this weeks train.

I see the following in the logs after hitting https://wikidata.beta.wmflabs.org/wiki/Wikidata:Main_Page a few times

Fatal error: Stack overflow in /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php on line 622
[Mon Mar  5 20:05:17 2018] [hphp] [20682:7fa8713ff700:22210:000001] [] \nFatal error: Stack overflow in /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/WikibaseRepo.php on line 622

[Mon Mar  5 20:05:02 2018] [hphp] [20682:7fa7d2fff700:22063:000001] [] \nCatchable fatal error: Argument 1 passed to __invoke() must be an instance of ApiMain, undefined variable given in /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Content
Catchable fatal error: Argument 1 passed to __invoke() must be an instance of ApiMain, undefined variable given in /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Content/EntityContentFactory.php on line 204

The first instance of that “must be an instance of ApiMain” error in EntityContentFactory is at 13:46 UTC, so a bit before the first stack overflow. No idea if they’re related.

#0 /srv/mediawiki/php-master/extensions/Wikibase/repo/Wikibase.php(250): MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 [internal function]: Closure$#5()
#2 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Content/EntityContentFactory.php(204): call_user_func(Closure$#5;2604)
#3 /srv/mediawiki/php-master/extensions/Wikibase/repo/RepoHooks.php(116): Wikibase\Repo\Content\EntityContentFactory->getContentHandlerForType(string)

I’m trying to understand what’s happening here, and it looks to me like $wgApiModules['wbsearchentities']['factory'] (Wikibase.php L250) was somehow registered as a content handler factory‽ So the EntityContentFactory tries to call that factory to get a content handler (with no arguments, because those factories aren’t supposed to take arguments) and it crashes because the wbsearchentities factory (a completely unrelated kind of factory!) expects an ApiMain. But how does that happen? I’m bewildered.

it looks to me like $wgApiModules['wbsearchentities']['factory'] (Wikibase.php L250) was somehow registered as a content handler factory‽

Yup, the wbsearchentities factory for one entity type and the wbremoveclaims for another one. No idea how. Investigation continues…

So I temporarily added a few debug statements on deployment-mediawiki04.deployment-prep.eqiad.wmflabs (only long enough for me to make one web request, then removed again immediately – @Ladsgroup said this was acceptable if the bug couldn’t be reproduced elsewhere), and it seems that adding this right after the wbsearchentities definition in Wikibase.php (L251)

file_put_contents( '/tmp/wbsearchentities-factory', var_export( $wgAPIModules['wbsearchentities']['factory'], true ) );

and this after WikibaseRepo.entitytypes.php is loaded in WikibaseRepo::getDefaultEntityTypes (L560)

file_put_contents( '/tmp/item-content-handler-factory-callback', var_export( $repoEntityTypes['item']['content-handler-factory-callback'], true ) );

makes the wiki work again. Somehow. (It seems to be reliable – I tried adding and removing them about ten times, and each time the wiki went from broken to working back to broken.)

The reason I added those debug statements was that I suspected the interpreter was mixing up closures, and I wanted to see if these two were dumped with the same ID (e. g. Closure$#5;2604). But it looks like those statements are actually enough to fix the bug, somehow. If my suspicion is correct (which could also explain the stack overflow in the initialization – there’s a closure in that stack trace too), I suppose it’s possible that printing a closure is enough to stop the interpreter from mixing them up, perhaps because it causes some kind of deoptimization or something… does this sound at all plausible?

Counterpoint: as far as I can tell from /var/log/dpkg.log*, HHVM hasn’t been upgraded on that host since February 8th, so if this really is an interpreter bug you’d think it should have started to happen sooner. So this theory definitely doesn’t provide a complete explanation yet as to how this problem came to be.

This is starting to sound like some sort of odd beta issue, it feels like something is cached somewhere?
I also wonder if the version of hvvm is the same on beta on production.

I guess we can wait and see if this hurts testwikidata in group0 deployment tonight.

I also wonder if the version of hvvm is the same on beta on production.

Good question. Special:Version reports the same string (3.18.6-dev (srv)) on mediawiki.org and en.wikipedia.beta.wmlabs.org. dpkg -s hhvm on deployment-mediawiki04.… reports 3.18.5+dfsg-1+wmf5+icu57, which has a different minor version number than what’s reported on Special:Version (by the same system, according to the server header). I don’t have production shell access, so I can’t check which dpkg version is installed there.

This is starting to sound like some sort of odd beta issue, it feels like something is cached somewhere?
I also wonder if the version of hvvm is the same on beta on production.

I guess we can wait and see if this hurts testwikidata in group0 deployment tonight.

I'm starting branch cut for this Shortly™, FYI.

Also the version of hhvm in beta and production do seem different: 3.18.5+dfsg-1+wmf5+icu57 on beta vs 3.18.5+dfsg-1+wmf4 in prod.

Would you be willing to proceed with group0 without knowing whether this bug will occur there or not, as @Addshore suggested, or does this already block the train?

Well, test.wikidata.org seems to be on .24 now and no breakage to be seen so far…

Addshore lowered the priority of this task from High to Medium.Mar 6 2018, 10:17 PM
Addshore added a subscriber: MoritzMuehlenhoff.

So i poked around a little bit and did the following on mediawiki04 while reading https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting between 21:48 and 21:53 UTC:

which hhvm-needs-restart
cat /usr/local/bin/hhvm-needs-restart
hhvmadm check-health
hhvm-dump-debug
links -dump http://127.0.0.1/server-status -width 80
curl localhost:9002/dump-apc
# The apc dump is visible @ /home/addshore/apc_dump
cd /var/log/hhvm
ls
sudo tail -f error.log
sudo tail -f *

And once I had done all of that https://wikidata.beta.wmflabs.org/wiki/Wikidata:Main_Page started responding.

During that time it looks like hhvm crashed and restarted https://logstash-beta.wmflabs.org/goto/a476120d0c2416111790af1e7e907e52
Apparently there is a stacktrace @ /var/log/hhvm/stacktrace.20682.log

Mar  6 21:51:37 deployment-mediawiki04 hhvm[20682]: Assertion failure: /srv/hhvm-3.18.5+dfsg/hphp/runtime/base/program-functions.cpp:2331: void HPHP::hphp_memory_cleanup(): assertion `g_context.isNull() || !g_context->hasRequestEventHandlers()' failed.
Mar  6 21:51:37 deployment-mediawiki04 hhvm: Core dumped: Aborted
Mar  6 21:51:37 deployment-mediawiki04 hhvm: Stack trace in /var/log/hhvm/stacktrace.20682.log
Mar  6 21:51:37 deployment-mediawiki04 hhvm[20682]: [Tue Mar  6 21:51:37 2018] [hphp] [20682:7fa83bbff700:0:000001] [7fa8b0df8890:7fa8ab25f067:7fa8ab260448:d53718:21362a6:11724ad:1172b99:d4f2ef:d4f4bb:f31306:7fa8b0df1064:7fa8ab31262d] Core dumped: Aborted
Mar  6 21:51:37 deployment-mediawiki04 hhvm[20682]: [Tue Mar  6 21:51:37 2018] [hphp] [20682:7fa83bbff700:0:000002] [7fa8b0df8890:7fa8ab25f067:7fa8ab260448:d53718:21362a6:11724ad:1172b99:d4f2ef:d4f4bb:f31306:7fa8b0df1064:7fa8ab31262d] Stack trace in /var/log/hhvm/stacktrace.20682.log

Looking back at the commands I executed in .bash_history with timestamps one lines up perfectly with the crash:

#1520373097
curl localhost:9002/dump-apc

https://www.epochconverter.com/timezones?q=1520373097 = Mar 06 2018 21:51:37

I'm not sure if @MoritzMuehlenhoff might be able to speculate a bit about what might have happened here?

Setting to normal now as this doesn't seem so important, and also removing the train blocker.

Addshore lowered the priority of this task from Medium to Low.

Switching to low now, the train seemed to go fine and test.wikidata.org is fine.

Assigning moritz for now just to see if there are any comments re T188924#4029245

After that we can just close this as some sort of random bug / faiulure

I think we can close this by now, it’s unlikely anything new will happen here. (But feel free to reopen if you disagree.)