Page MenuHomePhabricator

HHVM: Unknown exception
Closed, ResolvedPublic

Description

From /var/log/hhvm/error.log, found this on mw1280 (API server) when looking for an unrelated change:

Aug 21 11:16:53 mw1280 hhvm: #012Fatal error: unknown exception
Aug 21 11:16:53 mw1280 hhvm[21493]: [Mon Aug 21 11:16:53 2017] [hphp] [21493:7f2ff6fff700:6208:000001] [] \nFatal error: unknown exception

Come in spikes, but generally occurs more often in July:

for i in `ls error.log-2017*`; do echo -n $i ":"  ; grep "Fatal error: unknown exception" $i | wc -l  ; done

error.log-20170523 :2
error.log-20170524 :2
error.log-20170525 :10
error.log-20170526 :2
error.log-20170527 :6
error.log-20170528 :2
error.log-20170529 :2
error.log-20170530 :2
error.log-20170531 :12
error.log-20170601 :6
error.log-20170602 :0
error.log-20170603 :14
error.log-20170604 :4
error.log-20170605 :6
error.log-20170606 :2
error.log-20170607 :0
error.log-20170608 :0
error.log-20170609 :4
error.log-20170610 :4
error.log-20170611 :0
error.log-20170612 :0
error.log-20170613 :2
error.log-20170614 :76
error.log-20170615 :8
error.log-20170616 :4
error.log-20170617 :2
error.log-20170618 :0
error.log-20170619 :6
error.log-20170620 :8
error.log-20170621 :6
error.log-20170622 :12
error.log-20170623 :4
error.log-20170624 :12
error.log-20170625 :0
error.log-20170626 :12
error.log-20170627 :8
error.log-20170628 :8
error.log-20170629 :8
error.log-20170630 :24
error.log-20170701 :14
error.log-20170702 :8
error.log-20170703 :24
error.log-20170704 :18
error.log-20170705 :18
error.log-20170706 :76
error.log-20170707 :126
error.log-20170708 :192
error.log-20170709 :182
error.log-20170710 :108
error.log-20170711 :104
error.log-20170712 :104
error.log-20170713 :210
error.log-20170714 :88
error.log-20170715 :194
error.log-20170716 :218
error.log-20170717 :208
error.log-20170718 :114
error.log-20170719 :186
error.log-20170720 :178
error.log-20170721 :102
error.log-20170722 :162
error.log-20170723 :286
error.log-20170724 :164
error.log-20170725 :188
error.log-20170726 :186
error.log-20170727 :264
error.log-20170728 :200
error.log-20170729 :320
error.log-20170730 :40
error.log-20170731 :108
error.log-20170801 :98
error.log-20170802 :120
error.log-20170803 :90
error.log-20170804 :68
error.log-20170805 :42
error.log-20170806 :46
error.log-20170807 :28
error.log-20170808 :30
error.log-20170809 :60
error.log-20170810 :42
error.log-20170811 :44
error.log-20170812 :42
error.log-20170813 :66
error.log-20170814 :44
error.log-20170815 :120
error.log-20170816 :180
error.log-20170817 :228
error.log-20170818 :242
error.log-20170819 :280
error.log-20170820 :278

On a standard app server is occurs less often (e.g. mw1269):

error.log-20170523 :4
error.log-20170524 :2
error.log-20170525 :2
error.log-20170526 :0
error.log-20170527 :2
error.log-20170528 :0
error.log-20170529 :0
error.log-20170530 :0
error.log-20170531 :0
error.log-20170601 :0
error.log-20170602 :2
error.log-20170603 :0
error.log-20170604 :0
error.log-20170605 :2
error.log-20170606 :0
error.log-20170607 :0
error.log-20170608 :6
error.log-20170609 :6
error.log-20170610 :0
error.log-20170611 :2
error.log-20170612 :0
error.log-20170613 :2
error.log-20170614 :42
error.log-20170615 :2
error.log-20170616 :2
error.log-20170617 :4
error.log-20170618 :0
error.log-20170619 :2
error.log-20170620 :2
error.log-20170621 :2
error.log-20170622 :0
error.log-20170623 :4
error.log-20170624 :4
error.log-20170625 :10
error.log-20170626 :10
error.log-20170627 :8
error.log-20170628 :4
error.log-20170629 :4
error.log-20170630 :8
error.log-20170701 :6
error.log-20170702 :2
error.log-20170703 :8
error.log-20170704 :6
error.log-20170705 :10
error.log-20170706 :18
error.log-20170707 :22
error.log-20170708 :40
error.log-20170709 :42
error.log-20170710 :52
error.log-20170711 :36
error.log-20170712 :50
error.log-20170713 :52
error.log-20170714 :34
error.log-20170715 :64
error.log-20170716 :66
error.log-20170717 :66
error.log-20170718 :54
error.log-20170719 :74
error.log-20170720 :60
error.log-20170721 :42
error.log-20170722 :54
error.log-20170723 :68
error.log-20170724 :56
error.log-20170725 :62
error.log-20170726 :64
error.log-20170727 :50
error.log-20170728 :82
error.log-20170729 :42
error.log-20170730 :14
error.log-20170731 :14
error.log-20170801 :30
error.log-20170802 :20
error.log-20170803 :6
error.log-20170804 :28
error.log-20170805 :12
error.log-20170806 :12
error.log-20170807 :20
error.log-20170808 :14
error.log-20170809 :10
error.log-20170810 :10
error.log-20170811 :12
error.log-20170812 :10
error.log-20170813 :12
error.log-20170814 :20
error.log-20170815 :20
error.log-20170816 :24
error.log-20170817 :22
error.log-20170818 :22
error.log-20170819 :26
error.log-20170820 :18

Event Timeline

This may be the same problem as T171392, which has been noted to result in "unknown exception" errors from HHVM. The patch for that bug (76e84646) is, as far as I know, not included in the packages currently being deployed for T171166.

Now that 2.0.13 is rolled out to address the wikidata problems; are there other changes planned for luasandbox ATM? Otherwise I could cherrypick the patch into a local build, install it on the canary app servers and we keep an eye on it for a week. If that errors disappears, we can cut a 2.0.14 release and roll that out cluster-wide?

There are no more planned changes right now. Hopefully no new bugs crop up for a while. ;)

I prepared a build with the cherrypicked fix and installed it on mw1261 (one of our canary app servers). So far everything seems fine.

I'll leave it running until next week a while to see whether the exception error varnishes with that.

MoritzMuehlenhoff changed the visibility from "Custom Policy" to "All Users".

That seems to have fixed it:

for i in `ls error.log-2017*`; do echo -n $i ":"  ; grep "Fatal error: unknown exception" $i | wc -l  ; done
(..)
error.log-20170818 :28
error.log-20170819 :34
error.log-20170820 :14
error.log-20170821 :24
error.log-20170822 :36
error.log-20170823 :28
error.log-20170824 :42
error.log-20170825 :40
error.log-20170826 :22
error.log-20170827 :28
error.log-20170828 :14
error.log-20170829 :24
error.log-20170830 :24
error.log-20170831 :0
error.log-20170901 :0
error.log-20170902 :0
error.log-20170903 :0

Let's cut a 2.0.14 release to minimise confusion. When that's available I'll deploy it on the production cluster.

Legoktm changed the visibility from "All Users" to "Public (No Login Required)".Sep 4 2017, 8:03 AM

Mentioned in SAL (#wikimedia-operations) [2017-09-05T08:47:19Z] <moritzm> uploaded php-luasandbox/hhvm-luasandbox 2.0.14 to apt.wikimedia.org (T173705)

Mentioned in SAL (#wikimedia-operations) [2017-09-05T08:49:16Z] <moritzm> upgrading canary app servers to hhvm-luasandbox 2.0.14 (T173705)

I'm seeing failures in the Scribunto test suite using php-luasandbox 2.0.14 (my package for PHP7, not Wikimedia's HHVM one), but not under 2.0.13:

1) LuaSandbox: SandboxTests[1]: setfenv invalid level
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-ERROR: bad argument #1 to 'old_getfenv' (invalid level)
+ERROR: 'setfenv' cannot set the requested environment, it is protected

/home/km/projects/gerrit/mediawiki/extensions/Scribunto/tests/phpunit/engines/LuaCommon/LuaEngineTestBase.php:257
/home/km/projects/gerrit/mediawiki/core/tests/phpunit/MediaWikiTestCase.php:415
/home/km/projects/gerrit/mediawiki/core/maintenance/doMaintenance.php:92

2) LuaSandbox: SandboxTests[2]: getfenv invalid level
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-ERROR: bad argument #1 to 'old_getfenv' (invalid level)
+{
+}

And on CI which is using the HHVM package: https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/16036/console

At least locally downgrading to 2.0.13 makes all the tests pass again.

Mentioned in SAL (#wikimedia-operations) [2017-09-06T06:46:31Z] <moritzm> installing php-luasandbox 2.0.14 on API canaries along with HHVM restart (T173705)

Mentioned in SAL (#wikimedia-operations) [2017-09-13T07:47:16Z] <moritzm> rolling out hhvm-luasandbox 2.0.14 to the remaining hosts in eqiad (along with HHVM restarts) (T173705)

2.0.14 with a patch for this has been rolled out across the fleet, will check in a few days whether there's any further occurences in the logs.

That's fixed with 2.0.14.