Build / migrate to HHVM 3.18
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	Feb 15 2017, 12:36 PM

Description

HHVM 3.18 is the next LTS and now available in jessie-wikimedia/experimental

There were some stat_cache related bugs, which are fixed now. 3.18.2 is running on the mediawiki canaries, but the wider rollout is held back until after the DC switchover.

Details

	Subject	Repo	Branch	Lines +/-
	hhvm.server.stat_cache = false	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		MoritzMuehlenhoff	T158176 Build / migrate to HHVM 3.18
Resolved	PRODUCTION ERROR	Krinkle	T161095 Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php
Resolved		MoritzMuehlenhoff	T161520 HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script
Resolved		MoritzMuehlenhoff	T161684 HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() )
Resolved		MoritzMuehlenhoff	T162586 HHVM segfault in memory cleanup
Resolved		MoritzMuehlenhoff	T165043 HHVM 3.18 crashes in realloc() as exposed by luasandbox
Resolved		MoritzMuehlenhoff	T165051 HHVM 3.18 segfault on jobrunner / string handling
Resolved		hashar	T165074 CI tests failing with segfault
Resolved		MoritzMuehlenhoff	T165669 HHVM: Crash in server worker
Resolved		hashar	T167493 Switch CI tests back to HHVM 3.18
			Restricted Task

Event Timeline

MoritzMuehlenhoff created this task.Feb 15 2017, 12:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 12:36 PM

elukey subscribed.Feb 15 2017, 12:57 PM

Ricordisamoa subscribed.Feb 16 2017, 9:05 PM

Paladox subscribed.Feb 18 2017, 7:13 PM

HHVM 3.18 now requires webscalesqlclient also for the standard mysql extension, so I've changed the tarball generation to no longer prune webscalesqlclient. DENABLE_ASYNC_MYSQL is now broken in general as well, if not set to ON it misses to link libsquangle as well.

HHVM 3.18 also needs a build on librb5-dev now.

This also needed an updated build of lz4 since the extension in 3.18 requires the LZ4_compress_default() function only introduced in lz4 r130.

With these additional changes, 3.18 now builds, but fails the hphp/test/quick/json_bigint.php test introduced in 626ecb4c38e8acbb52e43d043aeb26069c873bb3

I've temporarily backed it out to produce a build for further debugging, this might be related to our build using json-c (or a difference in versions)

3.18.1 packages are now built. Next I'll rebuild 3.18.1 in Debian unstable to narrow down whether the hphp/test/quick/json_bigint.ph test failure is related to the outdated json-c in jessie compared to what's in stretch/sid.

When building on Debian stretch with json-c 0.12.1 the same error occurs (and also an additonal test failure), will open a bug for that.

Reported upstream at https://github.com/facebook/hhvm/issues/7708

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

HHVM extensions need to be rebuilt for the new 3.18 ABI.

hhvm-wikidiff2 and hhvm-luasandbox rebuilt without changes against the new 3.18 API. hhvm-tidy needed to be patched. Initially the build failed with

/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp: In member function ‘virtual void HPHP::TidyExtension::moduleInit()’:
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:86:34: error: invalid conversion from ‘void* (*)(size_t, HPHP::type_scan::Index) {aka void* (*)(long unsigned int,
short unsigned int)}’ to ‘TidyMalloc {aka void* (*)(long unsigned int)}’ [-fpermissive]
     tidySetMallocCall(req::malloc);
                                  ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:245:28: note: initializing argument 1 of ‘Bool tidySetMallocCall(TidyMalloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetMallocCall( TidyMalloc fmalloc );
                            ^
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:87:36: error: invalid conversion from ‘void* (*)(void*, size_t, HPHP::type_scan::Index) {aka void* (*)(void*, long
unsigned int, short unsigned int)}’ to ‘TidyRealloc {aka void* (*)(void*, long unsigned int)}’ [-fpermissive]
     tidySetReallocCall(req::realloc);
                                    ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:247:28: note: initializing argument 1 of ‘Bool tidySetReallocCall(TidyRealloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetReallocCall( TidyRealloc frealloc );
                            ^
CMakeFiles/tidy.dir/build.make:57: recipe for target 'CMakeFiles/tidy.dir/ext_tidy.cpp.o' failed

This is due to internal changes in HHVM required by the garbage collection that got implemented. In 3.12 the req::malloc API looks like this:

void* malloc(size_t nbytes);
void* calloc(size_t count, size_t bytes);
void* realloc(void* ptr, size_t nbytes);

While in 3.18 it provides additional type information for the GC:

void* malloc(size_t nbytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* calloc(size_t count, size_t bytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* realloc(void* ptr,
              size_t nbytes,
              type_scan::Index tyindex = type_scan::kIndexUnknown);

Fortunately HHVM provides new functions for memory allocation with unknown index type, I switched the extension to them with the patch below: This fixes the build.

--- hhvm-tidy-0.1.3~jessie1/ext_tidy.cpp        2016-04-05 07:19:49.000000000 +0000
+++ hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp        2017-03-07 13:00:14.308416908 +0000
@@ -83,8 +83,8 @@
   virtual void moduleInit() {
     HHVM_FE(tidy_repair_string);
     loadSystemlib();
-    tidySetMallocCall(req::malloc);
-    tidySetReallocCall(req::realloc);
+    tidySetMallocCall(req::malloc_noptrs);
+    tidySetReallocCall(req::realloc_noptrs);
     tidySetFreeCall(req::free);
   }
 } s_tidy_extension;

Mentioned in SAL (#wikimedia-operations) [2017-03-07T14:32:31Z] <moritzm> uploaded HHVM 3.18 builds of hhvm-tidy, hhvm-luasandbox and hhvm-wikidiff2 to the experimental section of apt.wikimedia.org (Bug: T158176)

Just relaying that translatewiki.net has been affected with https://github.com/facebook/hhvm/issues/7567 until I disabled stat_cache. As far as I know WMF has stat_cache enabled.

Looks like we're getting a few fatals on beta cluster too...

Looks pretty much the same what @Nikerabbit is seeing

root@deployment-mediawiki04:/var/log/hhvm# cat /var/log/hhvm/stacktrace.16002.log
Host: deployment-mediawiki04
ProcessID: 16002
ThreadID: 139719568520960
ThreadPID: 16116
Name: unknown program
Type: Aborted
Runtime: hhvm
Version: 3.18.1+dfsg-1
DebuggerCount: 0

Assertion Failure: /tmp/buildd/hhvm-3.18.1+dfsg/hphp/runtime/base/program-functions.cpp:2321: void HPHP::hphp_memory_cleanup(): assertion `g_context.isNull() || !g_context->hasRequestEventHandlers()' failed.

# 0  000000000219e496
# 1  0000000000f3dda2
# 2  00007f135a3b7890
# 3  00007f1354a2c067
# 4  00007f1354a2d448
# 5  0000000000d494a8
# 6  00000000021323b6
# 7  0000000000fd64e5
# 8  000000000219837d
# 9  000000000117f24d
# 10 000000000117f939
# 11 0000000000d51f2f
# 12 0000000000d520fb
# 13 0000000000f81926
# 14 00007f135a3b0064
# 15 00007f1354adf62d

PHP Stacktrace:

root@deployment-mediawiki04:/var/log/hhvm#

Change 341916 had a related patch set uploaded (by Reedy):
[operations/puppet] hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

gerritbot added a project: Patch-For-Review.Mar 8 2017, 10:30 PM

^ FWIW, not necessarily suggesting we merge that for production, but going to cherry pick onto the deployment puppetmaster for the time being to stop beta breaking

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

In T158176#3088412, @hashar wrote:

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

Was this before the automated restart stuff? Do we still have that?

The assertion failure definitely happens on the beta cluster. I haven't found a good way to reproduce. Looking at the wfDebugLog I managed to find some URL that probably triggered some, but trying the same URL apparently does not trigger it.

Some errors are logged in https://logstash-beta.wmflabs.org/app/kibana but the last one is from March 8th 23:00 ??

Yesterday log on beta for hphp_memory_cleanup _type:hhvm

hhvmassertionfailure.png (547×1 px, 82 KB)

Link to last 24 hours: https://logstash-beta.wmflabs.org/goto/2ccd8afa9af2422479fa44611b07b29a

The beta log spam stopped because StatCache has been disabled via cherry pick of https://gerrit.wikimedia.org/r/#/c/341916/ . That hardly made any change to the instances CPU usage then they are not having much traffic.

Maybe we can reenable it and enable core dump on beta? Might help figure out what is happening (unlikely but still).

hashar added a project: Upstream.Mar 9 2017, 5:45 PM

hashar updated the task description. (Show Details)

MoritzMuehlenhoff updated the task description. (Show Details)Mar 10 2017, 12:20 PM

In T158176#3074551, @MoritzMuehlenhoff wrote:

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

This is fixed upstream with https://github.com/facebook/hhvm/commit/f7a1dae3d90179c10c03e749875aa8a6a5015605, I've cherrypicked the patch to the Debian hhvm git repository (on which our internal packages are based)

\o/ "mofarrell commented an hour ago: A fix is on its way."

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

In T158176#3097998, @MoritzMuehlenhoff wrote:

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

(Y)

Mentioned in SAL (#wikimedia-operations) [2017-03-15T15:47:48Z] <moritzm> uploaded new HHVM 3.18 package with backported patch for stat_cache regression (T158176)

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

In T158176#3102910, @hashar wrote:

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

Both already done :)

The crash hasn't happened again over the course of the last ~ 16 hours in Beta (and previously we've hit it about 10 times per hour on mediawiki04.deployment-prep.eqiad,wmnet). I've reported success to the HHVM developers and also asked them to consider this for inclusion into the 3.18.2 release.

MoritzMuehlenhoff reopened this task as Open.Mar 16 2017, 8:40 AM

HHVM 3.18.1 has been upgraded on mwdebug* and mw1261 (but currently depooled for T161095)

MoritzMuehlenhoff added a subtask: T161520: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script .Mar 27 2017, 2:49 PM

Change 341916 abandoned by Reedy:
hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

MoritzMuehlenhoff closed subtask T161520: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script as Resolved.Apr 5 2017, 10:54 AM

MoritzMuehlenhoff added a subtask: T161684: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ).Apr 5 2017, 10:56 AM

The dead lock could be narrowed down to incorrect locking in stat_cache and was eventually fixed by upstream. I've built a new package which is now running on mw1261 (with stat_cache enabled) to validate whether the fix is sufficient.

MoritzMuehlenhoff updated the task description. (Show Details)Apr 10 2017, 9:24 AM

MoritzMuehlenhoff created subtask T162586: HHVM segfault in memory cleanup.Apr 10 2017, 9:27 AM

Krinkle closed subtask T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php as Resolved.Apr 10 2017, 11:36 PM

MoritzMuehlenhoff closed subtask T161684: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) as Resolved.Apr 11 2017, 2:46 PM

Catrope mentioned this in T163278: Four different PHP/HHVM versions on the cluster.Apr 18 2017, 11:05 PM

MoritzMuehlenhoff created subtask T165043: HHVM 3.18 crashes in realloc() as exposed by luasandbox.May 11 2017, 2:15 PM

MoritzMuehlenhoff created subtask T165051: HHVM 3.18 segfault on jobrunner / string handling.May 11 2017, 3:20 PM

daniel added a subtask: T165074: CI tests failing with segfault.May 11 2017, 5:42 PM

hashar mentioned this in T165074: CI tests failing with segfault.May 11 2017, 8:40 PM

Status update: The app servers mw1170-mw1184, mw1261-mw1265 are now running HHVM 3.18. It's working mostly fine there, we're seeing them crash every 2-3 days for T162586, but systemd restarts that reliably and I'm optimistic that we'll have a fix next week. The Treadmill deadlock is gone entirely.

In addition an image scaler (mw1293) and an API server (mw1189) have been upgraded and they're stable as well.

Also, one job runner (mw1161) was upgraded, it segfaulted five times yesterday and once today. Crashes are different (T165043 and T165051), at least for the crash in luasandbox it seems useful to first upgrade to a fix for T162685. This might just as well be related. I've downgraded the job runner back to 3.12 until a new 3.18 package is available next week.

In addition the new HHVM package has caused test failures in CI tests (T165074). The test will be unbroken by the fixed packages for T162685 as well, until then I've copied the 3.12 packages to experimental and I'm working with Antoine to have the CI tests run 3.12 again until a fix for T162685 has landed and has been verified to make the CI tests pass again.

hashar closed subtask T165074: CI tests failing with segfault as Resolved.May 12 2017, 2:04 PM

MoritzMuehlenhoff created subtask T165669: HHVM: Crash in server worker.May 18 2017, 12:31 PM

MoritzMuehlenhoff closed subtask T162586: HHVM segfault in memory cleanup as Resolved.May 30 2017, 9:15 AM

MoritzMuehlenhoff closed subtask T165051: HHVM 3.18 segfault on jobrunner / string handling as Resolved.Jun 9 2017, 7:15 AM

MoritzMuehlenhoff closed subtask T165669: HHVM: Crash in server worker as Resolved.

MoritzMuehlenhoff created subtask T167493: Switch CI tests back to HHVM 3.18.Jun 9 2017, 7:19 AM

Status update: With HHVM 3.18.2+wmf5 we now have a stable HHVM package. It has been rolled out on all appservers, all image scalers, most API servers and about one quarter of the job runners in eqiad (codfw is complete).

MoritzMuehlenhoff closed subtask T165043: HHVM 3.18 crashes in realloc() as exposed by luasandbox as Resolved.Jun 13 2017, 8:27 AM

MoritzMuehlenhoff created subtask Restricted Task.Jun 13 2017, 8:49 AM

The migration to HHVM 3.18 is almost complete; the two remaining systems (terbium/wasat) will be updated tomorrow.

hashar closed subtask T167493: Switch CI tests back to HHVM 3.18 as Resolved.Jun 14 2017, 10:25 AM

3.18 might resolve T97192: HHVM request timeouts not working; support lowering the API request timeout per request as well. This needs to be verified.

All our jessie-based application servers (app servers, API servers, job runners and image scalers, script runners) are now upgraded to HHVM 3.18.2 (plus a few cherrypicked backports from master, I'll ask the HHVM developers to merge those into the next 3.18 LTS release so that people running mediawiki with the debs provided by HHVM have these fixed as well).

We found a number of bugs in HHVM (some of them were already present in 3.12), but they've all been fixed in the latest package.

Compared to before, HHVM is more stable than with 3.12; the deadlocks in HPHP::Treadmill are gone completely and over the course of the last 72 hours we only had a single segfault across all our 300 production servers (half of them being inactive in codfw, of course).

Krinkle mentioned this in T167885: Regression: Backend Save Timing raised by about 50%.Jun 15 2017, 10:23 PM

Krinkle closed subtask Restricted Task as Resolved.Oct 3 2019, 3:40 AM

Build / migrate to HHVM 3.18Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Build / migrate to HHVM 3.18
Closed, ResolvedPublic
Actions

Related Objects
Search...