Page MenuHomePhabricator

New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests
Closed, ResolvedPublic

Description

HHVM CI jobs running on Jessie seg fault with HHVM 3.12.11

Symptoms

See for example https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/5115/console. The tests seem to complete, but then the log reports a segfault and Jenkins marks the job as failed.

16:13:38 Time: 5.28 minutes, Memory: 763.68MB
16:13:38 
16:13:38 OK, but incomplete, skipped, or risky tests!
16:13:38 Tests: 14107, Assertions: 116135, Skipped: 119, Risky: 1.

16:13:43 /srv/deployment/integration/slave-scripts/bin/mw-run-phpunit-allexts.sh: line 23: 
         vvvvvvvvvvvvvvvvvv
    2306 Segmentation fault      (core dumped) php -dzend.enable_gc=0 phpunit.php --log-junit "$JUNIT_DEST" --testsuite extensions
         ^^^^^^^^^^^^^^^^^^
16:13:43 Build step 'Execute shell' marked build as failure
16:13:43 Recording test results
16:13:48 Archiving artifacts
16:13:51 [PostBuildScript] - Execution post build scripts.
16:13:51 [PostBuildScript] Build is not success : do not execute script

16:13:51 Finished: FAILURE

That also affects mediawiki-phpunit-hhvm-jessie, see for example https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-jessie/2859/console.


Last working builds were shortly before 15:00 UTC.

14:12	<moritzm>	uploaded hhvm 3.12.12 to carbon

Nodepool start refreshing the images at 14:14UTC which takes roughly 10 minutes. Once complete, newly created instances use that new image.

Reproduction

In short: /usr/bin/hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/import/

On a Jessie instance setup a basic test env:

# MediaWiki + vendor + installation
git clone --depth 1 https://gerrit.wikimedia.org/r/p/mediawiki/core.git mediawiki-core
cd mediawiki-core
git clone --depth 1 https://gerrit.wikimedia.org/r/p/mediawiki/vendor.git
mkdir data
php maintenance/install.php --dbtype sqlite --dbpath=$(pwd)/data --pass XXXXXX bugwiki wikiadmin

# Get composer + dev dependencies
git clone  --single-branch https://gerrit.wikimedia.org/r/p/integration/composer
./composer/vendor/bin/composer install

Run under GDB and exercise includes/import/

gdb --args /usr/bin/hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/import/
(gdb) run

Starting program: /usr/bin/hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/import/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe53ff700 (LWP 29491)]
[Thread 0x7fffe53ff700 (LWP 29491) exited]
Using HHVM 3.12.11 (5.6.99-hhvm)
PHPUnit 4.8.31 by Sebastian Bergmann and contributors.

........

Time: 3 seconds, Memory: 16.61MB

OK (8 tests, 15 assertions)

Program received signal SIGSEGV, Segmentation fault.
HPHP::UserFile::close (this=0x7fffde1b6690)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/user-file.cpp:202
202     /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/user-file.cpp: No such file or directory.

(gdb)  bt
#0  HPHP::UserFile::close (this=0x7fffde1b6690)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/user-file.cpp:202
#1  0x000000000121a8c2 in HPHP::XMLReader::close (this=0x7fffde0e5e20)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/ext/xmlreader/ext_xmlreader.cpp:95
#2  0x000000000198aa7f in HPHP::MemoryManager::sweep (this=0x7ffff7fbe830, 
    this@entry=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/memory-manager.cpp:482
#3  0x0000000001989ead in HPHP::hphp_memory_cleanup ()
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/program-functions.cpp:2185
#4  0x00000000009eee57 in HPHP::hphp_session_exit ()
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/program-functions.cpp:2219
#5  0x00000000009f5fe7 in HPHP::execute_command_line_end (xhprof=0, coverage=coverage@entry=true, 
    program=0x7fffdf6f4b58 "tests/phpunit/phpunit.php")
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/program-functions.cpp:707
#6  0x00000000009fe9b7 in HPHP::execute_program_impl (argc=argc@entry=5, argv=argv@entry=0x7fffffffe5c8)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/program-functions.cpp:1722
#7  0x0000000000a0067e in HPHP::execute_program (argc=argc@entry=5, argv=argv@entry=0x7fffffffe5c8)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/program-functions.cpp:1040
#8  0x00000000008dd1f5 in main (argc=<optimized out>, argv=0x7fffffffe5c8)
    at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/hhvm/main.cpp:60

(gdb) up
#1  0x000000000121a8c2 in HPHP::XMLReader::close (this=0x7fffde0e5e20)
(gdb)

Event Timeline

hashar renamed this task from Jenkins is failing mediawiki-extensions-hhvm-jessie due to segfault to MediaWiki tests causes HHVM to segfault on Jessie.Feb 1 2017, 5:48 PM
hashar added projects: HHVM, SRE.
hashar updated the task description. (Show Details)
hashar subscribed.

I am not available for the next two hours. Hopefully back at 8pm UTC

Updated the task details but I guess it is a new HHVM version that causes the segfault:

Last working builds were shortly before 15:00 UTC.

14:12	<moritzm>	uploaded hhvm 3.12.12 to carbon

Nodepool start refreshing the images at 14:14UTC which takes roughly 10 minutes. Once complete, newly created instances use that new image.

MediaWiki install.php dumps the HHVM version in use, so one can compare the console output between the last successful build and the first to fail. If the diff shows the HHVM version changed, that would confirm the sequence above and HHVM is to blame.

Luckily Nodepool keeps the previous version of the snapshot.

idhostnameimage idAge
1318ci-jessie-wikimedia-1485872040f8834655-f8ab-46aa-8dd2-9933ed55005a27h ago
1320ci-jessie-wikimedia-1485958440d8418f09-659e-4c4c-8d41-275d974fc5173h30 ago

To spawn an instance, Nodepool pick the youngest one. I guess to rollback we can just delete 1320 then 'nodepool delete' all Jessie instances so new instances spawn with the 1318 image.

If HHVM is to blame, that will fail again magically tomorrow after 14:14UTC when the nodepool image get refreshed again.

NOTE: it is possible that the segfault is not due to HHVM new version. But that is highly suspicious.

To spawn an instance, Nodepool pick the youngest one. I guess to rollback we can just delete 1320 then 'nodepool delete' all Jessie instances so new instances spawn with the 1318 image.

I have delete nodepool image 1320 (nodepool image-delete 1320) and all instances that were "building" or "ready" right after I did that. Hopefully that means we'll be using the old hhvm for future builds.

Mentioned in SAL (#wikimedia-releng) [2017-02-01T18:51:22Z] <thcipriani> nodepool delete-image 1320 per T156923

https://gerrit.wikimedia.org/r/#/c/323401/ had two builds run

Failed https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-jessie/2863/
Success at 18:51:04 UTC https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-jessie/2864/

So the rollback fixed the immediate issue.

Doing a diff between the full console logs show:

FAILURE: hhvm 3.12.11
SUCCESS: hhvm 3.12.7

Which mean there is something wrong with the new HHVM version. That needs to be tracked down.

Meanwhile, we need the HHVM Debian package to be rollbacked from apt.wikimedia.org .

hashar renamed this task from MediaWiki tests causes HHVM to segfault on Jessie to New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests.Feb 1 2017, 8:05 PM

I managed to reproduce locally with HHVM 3.12.11+dfsg-1+wmf1 from apt.wikimedia.org and its lib dependencies from jessie-backports.

Using HEAD of mediawiki/core and dependencies installed via composer update:

$ cd tests/phpunit
hhvm -v Eval.Jit=false phpunit.php --log-junit /tmp/junit-mw-phpunit.xml ''
...
Core dumped: Segmentation fault
Stack trace in /tmp/stacktrace.25825.log
Segmentation fault

Trace file is empty though:

stacktrace.25825.log
PHP Stacktrace:

Again when trying to get HHVM to generate a core dump, but nothing get generated for some reason. Will look at it post lunch:

hhvm -v Eval.Jit=false -v Debug.CoreDumpReport=true -v Debug.CoreDumpReportDirectory=/tmp phpunit.php --log-junit /tmp/junit-mw-phpunit.xml ''

Bash limits the size of core files to 0 which means no core get generated. Gotta set it to unlimited.

--log-junit /tmp/junit-mw-phpunit.xml is not needed.

$ ulimit -c unlimited
$ cd tests/phpunit
$ hhvm -v Eval.Jit=false phpunit.php
...

Then for the debug fun with hhvm-gdb:

$ hhvm-gdb /usr/bin/hhvm core
Reading symbols from /usr/bin/hhvm...Reading symbols from /usr/lib/debug/.build-id/5b/bfa502149e67a16ff216ed42e36620fe8a36a4.debug...done.
done.
[New LWP 1014]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `hhvm -v Eval.Jit=false phpunit.php'.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  setIsClosed (closed=true, this=0x7f7fdf5e7f10) at /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/file.h:277
277     /tmp/buildd/hhvm-3.12.11+dfsg/hphp/runtime/base/file.h: No such file or directory.
(gdb) walkstk 
#0  {inline frame} @ 0x00999cca: setIsClosed() at hphp/.../base/file.h:277                                                                                                                      
#1  0x7fff574f9a30 @ 0x00999cca: HPHP::UserFile::close() at hphp/.../base/user-file.cpp:202                                                                                                     
#2  0x7fff574f9a60 @ 0x0121a8c2: HPHP::XMLReader::close() at hphp/.../xmlreader/ext_xmlreader.cpp:95                                                                                            
#3  0x7fff574f9a80 @ 0x0198aa7f: HPHP::MemoryManager::sweep() at hphp/.../base/memory-manager.cpp:482                                                                                           
#4  0x7fff574f9ac0 @ 0x01989ead: HPHP::hphp_memory_cleanup() at hphp/.../base/program-functions.cpp:2185                                                                                        
#5  0x7fff574f9b20 @ 0x009eee57: HPHP::hphp_session_exit() at hphp/.../base/program-functions.cpp:2219                                                                                          
#6  0x7fff574f9b90 @ 0x009f5fe7: HPHP::execute_command_line_end() at hphp/.../base/program-functions.cpp:707
#7  0x7fff574fb740 @ 0x009fe9b7: HPHP::execute_program_impl() at hphp/.../base/program-functions.cpp:1722
#8  0x7fff574fb790 @ 0x00a0067e: HPHP::execute_program() at hphp/.../base/program-functions.cpp:1040
#9  0x7fff574fb840 @ 0x008dd1f5: main() at hphp/.../hhvm/main.cpp:60

Installed on mwdebug1001 under /home/hashar/mediawiki-core

git clone --depth 1 --single-branch https://gerrit.wikimedia.org/r/p/mediawiki/core.git mediawiki-core
cd mediawiki-core
git clone --depth 1 --single-branch https://gerrit.wikimedia.org/r/p/mediawiki/vendor.git
mkdir data
php maintenance/install.php --dbtype sqlite --dbpath=$(pwd)/data --pass XXXXXX bugwiki wikiadmin

hhvm tests/phpunit/phpunit.php 
>> PHPUnit not found. Please install it and other dev dependencies by
>>                running `composer install` in MediaWiki root directory.

It fails since mediawiki/vendor does not ship development dependencies. And downloading them from the internet/Packagist.org on a prod machine is a NO/NO.

On deployment-tin additionally did:

git clone --depth 1 --single-branch https://gerrit.wikimedia.org/r/p/integration/composer
./composer/vendor/bin/composer install

Deployment-tin does not segfault. I am thus upgrading HHVM: Unpacking hhvm-dbg (3.12.11+dfsg-1+wmf1) over (3.12.7+dfsg-1+wmf1) ...

segfault!

Fast repro is:

hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/import/

That exercise the code in includes/import/WikiImporter.php which relies on XMLReader.

I think I've identified the problem, a new package is building on copper ATM.

Mentioned in SAL (#wikimedia-releng) [2017-02-02T14:15:10Z] <hashar> Nodepool: delete the image building of Jessie (image id 1322) to prevent a faulty HHVM version from being added. T156923

I backed out the bzip2-segfault-sweep.patch introduced in 3.12.11+dfsg-1 and that fixes it. Will upload new packages to apt.wikimedia.org tomorrow.

MoritzMuehlenhoff claimed this task.

A new HHVM package has been uploaded to apt.wikimedia.org

Mentioned in SAL (#wikimedia-releng) [2017-02-03T09:54:22Z] <hashar> Regenerate Nodepool Jessie snapshot. Would get a new HHVM version T156923

Mentioned in SAL (#wikimedia-releng) [2017-02-03T10:12:04Z] <hashar> Image ci-jessie-wikimedia-1486115643 in wmflabs-eqiad is ready T156923

Reopening since it might still fails on some CI jobs and/or on the beta cluster. At least mediawiki/core tests no more segfault.

MoritzMuehlenhoff lowered the priority of this task from Unbreak Now! to High.Feb 3 2017, 10:29 AM

Mentioned in SAL (#wikimedia-releng) [2017-02-03T10:42:26Z] <hashar> Image ci-jessie-wikimedia-1486115643 in wmflabs-eqiad is ready T156923

There were mentions of Scribunto errors. I did a check experimental on a Scribunto change ( https://gerrit.wikimedia.org/r/#/c/43176/14 ). That triggers the extensions tests on a Nodepool Jessie (HHVM 3.12.11). It passed just fine https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/181/console

So definitely all fine CI wise.

The Scribunto error is on production canaries and got reported at T157110