HHVM: Unknown exception
Closed, ResolvedPublic

Description

From /var/log/hhvm/error.log, found this on mw1280 (API server) when looking for an unrelated change:

Aug 21 11:16:53 mw1280 hhvm: #012Fatal error: unknown exception
Aug 21 11:16:53 mw1280 hhvm[21493]: [Mon Aug 21 11:16:53 2017] [hphp] [21493:7f2ff6fff700:6208:000001] [] \nFatal error: unknown exception

Come in spikes, but generally occurs more often in July:

for i in `ls error.log-2017*`; do echo -n $i ":"  ; grep "Fatal error: unknown exception" $i | wc -l  ; done

error.log-20170523 :2
error.log-20170524 :2
error.log-20170525 :10
error.log-20170526 :2
error.log-20170527 :6
error.log-20170528 :2
error.log-20170529 :2
error.log-20170530 :2
error.log-20170531 :12
error.log-20170601 :6
error.log-20170602 :0
error.log-20170603 :14
error.log-20170604 :4
error.log-20170605 :6
error.log-20170606 :2
error.log-20170607 :0
error.log-20170608 :0
error.log-20170609 :4
error.log-20170610 :4
error.log-20170611 :0
error.log-20170612 :0
error.log-20170613 :2
error.log-20170614 :76
error.log-20170615 :8
error.log-20170616 :4
error.log-20170617 :2
error.log-20170618 :0
error.log-20170619 :6
error.log-20170620 :8
error.log-20170621 :6
error.log-20170622 :12
error.log-20170623 :4
error.log-20170624 :12
error.log-20170625 :0
error.log-20170626 :12
error.log-20170627 :8
error.log-20170628 :8
error.log-20170629 :8
error.log-20170630 :24
error.log-20170701 :14
error.log-20170702 :8
error.log-20170703 :24
error.log-20170704 :18
error.log-20170705 :18
error.log-20170706 :76
error.log-20170707 :126
error.log-20170708 :192
error.log-20170709 :182
error.log-20170710 :108
error.log-20170711 :104
error.log-20170712 :104
error.log-20170713 :210
error.log-20170714 :88
error.log-20170715 :194
error.log-20170716 :218
error.log-20170717 :208
error.log-20170718 :114
error.log-20170719 :186
error.log-20170720 :178
error.log-20170721 :102
error.log-20170722 :162
error.log-20170723 :286
error.log-20170724 :164
error.log-20170725 :188
error.log-20170726 :186
error.log-20170727 :264
error.log-20170728 :200
error.log-20170729 :320
error.log-20170730 :40
error.log-20170731 :108
error.log-20170801 :98
error.log-20170802 :120
error.log-20170803 :90
error.log-20170804 :68
error.log-20170805 :42
error.log-20170806 :46
error.log-20170807 :28
error.log-20170808 :30
error.log-20170809 :60
error.log-20170810 :42
error.log-20170811 :44
error.log-20170812 :42
error.log-20170813 :66
error.log-20170814 :44
error.log-20170815 :120
error.log-20170816 :180
error.log-20170817 :228
error.log-20170818 :242
error.log-20170819 :280
error.log-20170820 :278

On a standard app server is occurs less often (e.g. mw1269):

error.log-20170523 :4
error.log-20170524 :2
error.log-20170525 :2
error.log-20170526 :0
error.log-20170527 :2
error.log-20170528 :0
error.log-20170529 :0
error.log-20170530 :0
error.log-20170531 :0
error.log-20170601 :0
error.log-20170602 :2
error.log-20170603 :0
error.log-20170604 :0
error.log-20170605 :2
error.log-20170606 :0
error.log-20170607 :0
error.log-20170608 :6
error.log-20170609 :6
error.log-20170610 :0
error.log-20170611 :2
error.log-20170612 :0
error.log-20170613 :2
error.log-20170614 :42
error.log-20170615 :2
error.log-20170616 :2
error.log-20170617 :4
error.log-20170618 :0
error.log-20170619 :2
error.log-20170620 :2
error.log-20170621 :2
error.log-20170622 :0
error.log-20170623 :4
error.log-20170624 :4
error.log-20170625 :10
error.log-20170626 :10
error.log-20170627 :8
error.log-20170628 :4
error.log-20170629 :4
error.log-20170630 :8
error.log-20170701 :6
error.log-20170702 :2
error.log-20170703 :8
error.log-20170704 :6
error.log-20170705 :10
error.log-20170706 :18
error.log-20170707 :22
error.log-20170708 :40
error.log-20170709 :42
error.log-20170710 :52
error.log-20170711 :36
error.log-20170712 :50
error.log-20170713 :52
error.log-20170714 :34
error.log-20170715 :64
error.log-20170716 :66
error.log-20170717 :66
error.log-20170718 :54
error.log-20170719 :74
error.log-20170720 :60
error.log-20170721 :42
error.log-20170722 :54
error.log-20170723 :68
error.log-20170724 :56
error.log-20170725 :62
error.log-20170726 :64
error.log-20170727 :50
error.log-20170728 :82
error.log-20170729 :42
error.log-20170730 :14
error.log-20170731 :14
error.log-20170801 :30
error.log-20170802 :20
error.log-20170803 :6
error.log-20170804 :28
error.log-20170805 :12
error.log-20170806 :12
error.log-20170807 :20
error.log-20170808 :14
error.log-20170809 :10
error.log-20170810 :10
error.log-20170811 :12
error.log-20170812 :10
error.log-20170813 :12
error.log-20170814 :20
error.log-20170815 :20
error.log-20170816 :24
error.log-20170817 :22
error.log-20170818 :22
error.log-20170819 :26
error.log-20170820 :18
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 21 2017, 11:34 AM
MoritzMuehlenhoff triaged this task as Normal priority.Aug 21 2017, 11:34 AM
Aklapper removed subscribers: Operations, HHVM.
Anomie added a subscriber: Anomie.Aug 22 2017, 2:23 PM

This may be the same problem as T171392, which has been noted to result in "unknown exception" errors from HHVM. The patch for that bug (76e84646) is, as far as I know, not included in the packages currently being deployed for T171166.

Now that 2.0.13 is rolled out to address the wikidata problems; are there other changes planned for luasandbox ATM? Otherwise I could cherrypick the patch into a local build, install it on the canary app servers and we keep an eye on it for a week. If that errors disappears, we can cut a 2.0.14 release and roll that out cluster-wide?

There are no more planned changes right now. Hopefully no new bugs crop up for a while. ;)

I prepared a build with the cherrypicked fix and installed it on mw1261 (one of our canary app servers). So far everything seems fine.

I'll leave it running until next week a while to see whether the exception error varnishes with that.

MoritzMuehlenhoff changed the visibility from "Custom Policy" to "All Users".

That seems to have fixed it:

for i in `ls error.log-2017*`; do echo -n $i ":"  ; grep "Fatal error: unknown exception" $i | wc -l  ; done
(..)
error.log-20170818 :28
error.log-20170819 :34
error.log-20170820 :14
error.log-20170821 :24
error.log-20170822 :36
error.log-20170823 :28
error.log-20170824 :42
error.log-20170825 :40
error.log-20170826 :22
error.log-20170827 :28
error.log-20170828 :14
error.log-20170829 :24
error.log-20170830 :24
error.log-20170831 :0
error.log-20170901 :0
error.log-20170902 :0
error.log-20170903 :0

Let's cut a 2.0.14 release to minimise confusion. When that's available I'll deploy it on the production cluster.

Legoktm changed the visibility from "All Users" to "Public (No Login Required)".Mon, Sep 4, 8:03 AM
Legoktm added a subscriber: Legoktm.Mon, Sep 4, 8:45 AM

2.0.14 has been tagged.

Mentioned in SAL (#wikimedia-operations) [2017-09-05T08:47:19Z] <moritzm> uploaded php-luasandbox/hhvm-luasandbox 2.0.14 to apt.wikimedia.org (T173705)

Mentioned in SAL (#wikimedia-operations) [2017-09-05T08:49:16Z] <moritzm> upgrading canary app servers to hhvm-luasandbox 2.0.14 (T173705)

I'm seeing failures in the Scribunto test suite using php-luasandbox 2.0.14 (my package for PHP7, not Wikimedia's HHVM one), but not under 2.0.13:

1) LuaSandbox: SandboxTests[1]: setfenv invalid level
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-ERROR: bad argument #1 to 'old_getfenv' (invalid level)
+ERROR: 'setfenv' cannot set the requested environment, it is protected

/home/km/projects/gerrit/mediawiki/extensions/Scribunto/tests/phpunit/engines/LuaCommon/LuaEngineTestBase.php:257
/home/km/projects/gerrit/mediawiki/core/tests/phpunit/MediaWikiTestCase.php:415
/home/km/projects/gerrit/mediawiki/core/maintenance/doMaintenance.php:92

2) LuaSandbox: SandboxTests[2]: getfenv invalid level
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-ERROR: bad argument #1 to 'old_getfenv' (invalid level)
+{
+}

And on CI which is using the HHVM package: https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/16036/console

At least locally downgrading to 2.0.13 makes all the tests pass again.

Anomie added a comment.Tue, Sep 5, 7:42 PM

I'm working on it.

Tests fixed by @Anomie :)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T06:46:31Z] <moritzm> installing php-luasandbox 2.0.14 on API canaries along with HHVM restart (T173705)

Mentioned in SAL (#wikimedia-operations) [2017-09-13T07:47:16Z] <moritzm> rolling out hhvm-luasandbox 2.0.14 to the remaining hosts in eqiad (along with HHVM restarts) (T173705)

2.0.14 with a patch for this has been rolled out across the fleet, will check in a few days whether there's any further occurences in the logs.

MoritzMuehlenhoff closed this task as Resolved.Tue, Sep 19, 9:51 AM

That's fixed with 2.0.14.